sample-eval 评测

All results: 
| sampler_name                     |   ('metric', 'drop') |   ('metric', 'gpqa') |   ('metric', 'math') |   ('metric', 'mgsm') |   ('metric', 'mmlu') |
|:---------------------------------|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
| gpt-4-turbo-2024-04-09_assistant |               46.667 |                  0.2 |                  0.2 |             0.209091 |                    0 |
| gpt-4-turbo-2024-04-09_chatgpt   |               36.667 |                  0.2 |                  0.4 |             0.272727 |                    0 |
| gpt-4o-mini-2024-07-18           |               50     |                  0   |                  0.4 |             0.227273 |                    0 |
| gpt-4o_assistant                 |               46.667 |                  0.2 |                  0.2 |             0.263636 |                    0 |
| gpt-4o_chatgpt                   |               48.667 |                  0.2 |                  0.2 |             0.281818 |                    0 |

All results: 
| sampler_name    |   ('metric', 'drop') |   ('metric', 'gpqa') |   ('metric', 'math') |   ('metric', 'mgsm') |   ('metric', 'mmlu') |
|:----------------|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
| doubao-lite-32k |              48.4181 |             0.180303 |               0.2712 |             0.265818 |               0.5384 |