sample-eval 评测
All results:
| sampler_name | ('metric', 'drop') | ('metric', 'gpqa') | ('metric', 'math') | ('metric', 'mgsm') | ('metric', 'mmlu') |
|:---------------------------------|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
| gpt-4-turbo-2024-04-09_assistant | 46.667 | 0.2 | 0.2 | 0.209091 | 0 |
| gpt-4-turbo-2024-04-09_chatgpt | 36.667 | 0.2 | 0.4 | 0.272727 | 0 |
| gpt-4o-mini-2024-07-18 | 50 | 0 | 0.4 | 0.227273 | 0 |
| gpt-4o_assistant | 46.667 | 0.2 | 0.2 | 0.263636 | 0 |
| gpt-4o_chatgpt | 48.667 | 0.2 | 0.2 | 0.281818 | 0 |
All results:
| sampler_name | ('metric', 'drop') | ('metric', 'gpqa') | ('metric', 'math') | ('metric', 'mgsm') | ('metric', 'mmlu') |
|:----------------|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
| doubao-lite-32k | 48.4181 | 0.180303 | 0.2712 | 0.265818 | 0.5384 |