LLM 大模型评测

背景

需要对于私有部署的模型进行效果评估。从而保证模型的质量。

过程

使用 simple-eval 评测。

方法

数据集	解释
drop	比较复杂
gpqa	填空题，使用 prompt 让模型返回具体答案，然后判断答案是否相等
math	填空题，使用 prompt 让模型返回具体答案，然后正则表达式切出具体的答案，然后使用大模型来判断答案是否相等
mgsm	填空题，使用 prompt 让模型返回具体答案，然后判断答案是否相等
mmlu	单选题，使用 prompt 让模型返回具体答案，然后正则表达式切出具体的答案，判断答案是否相等
humaneval	解答题，使用 prompt 让模型返回具体代码片段，然后执行 python 代码，assert 是否报错判断是否通过

结果

(.venv) (base) ➜  GolandProjects python -m simple-evals.demo

{'mmlu': <simple-evals.mmlu_eval.MMLUEval object at 0x1143bde70>, 'math': <simple-evals.math_eval.MathEval object at 0x1143bf0a0>, 'gpqa': <simple-evals.gpqa_eval.GPQAEval object at 0x1143bef20>, 'mgsm': <simple-evals.mgsm_eval.MGSMEval object at 0x1143bf130>, 'drop': <simple-evals.drop_eval.DropEval object at 0x1143befb0>}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [10:06<00:00,  4.12it/s]
Writing report to /tmp/mmlu_doubao-lite-32k.html
{'other': np.float64(0.5790349417637272), 'other:std': np.float64(0.49371396372839627), 'score:std': np.float64(0.498523259236718), 'stem': np.float64(0.4224299065420561), 'stem:std': np.float64(0.49394623249998165), 'humanities': np.float64(0.5196195005945303), 'humanities:std': np.float64(0.4996149269151404), 'social_sciences': np.float64(0.6405353728489483), 'social_sciences:std': np.float64(0.4798435255145234), 'score': np.float64(0.5384)}
Writing results to /tmp/mmlu_doubao-lite-32k.json
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [31:48<00:00,  1.31it/s]
Writing report to /tmp/math_doubao-lite-32k.html
{'score:std': np.float64(0.4445790818290937), 'score': np.float64(0.2712)}
Writing results to /tmp/math_doubao-lite-32k.json
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [24:49<00:00,  1.33it/s]
Writing report to /tmp/gpqa_doubao-lite-32k.html
{'chars': np.float64(980.6404040404041), 'chars:std': np.float64(984.3785210532318), 'score:std': np.float64(0.384439653998615), 'score': np.float64(0.1803030303030303)}
Writing results to /tmp/gpqa_doubao-lite-32k.json
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2750/2750 [18:11<00:00,  2.52it/s]
Writing report to /tmp/mgsm_doubao-lite-32k.html
{'bn': np.float64(0.108), 'bn:std': np.float64(0.31038041175306147), 'group_non_latin': np.float64(0.30533333333333335), 'group_non_latin:std': np.float64(0.4605484652985925), 'score:std': np.float64(0.4417678983731818), 'de': np.float64(0.212), 'de:std': np.float64(0.40872484632084705), 'group_latin': np.float64(0.2184), 'group_latin:std': np.float64(0.413160307870928), 'en': np.float64(0.44), 'en:std': np.float64(0.4963869458396343), 'es': np.float64(0.236), 'es:std': np.float64(0.4246221850068599), 'fr': np.float64(0.196), 'fr:std': np.float64(0.3969685126052191), 'ja': np.float64(0.484), 'ja:std': np.float64(0.499743934430424), 'ru': np.float64(0.36), 'ru:std': np.float64(0.48), 'sw': np.float64(0.008), 'sw:std': np.float64(0.08908422980528034), 'te': np.float64(0.072), 'te:std': np.float64(0.2584879107424562), 'th': np.float64(0.184), 'th:std': np.float64(0.38748419322599476), 'zh': np.float64(0.624), 'zh:std': np.float64(0.4843800161030593), 'score': np.float64(0.26581818181818184)}
Writing results to /tmp/mgsm_doubao-lite-32k.json
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [11:08<00:00,  2.99it/s]
Writing report to /tmp/drop_doubao-lite-32k.html
{'em_score': np.float64(0.4315), 'em_score:std': np.float64(0.49528552371334256), 'f1_score': np.float64(48.418095), 'f1_score:std': np.float64(48.14274043322601), 'score:std': np.float64(0.499270467782744), 'score': np.float64(0.527)}
Writing results to /tmp/drop_doubao-lite-32k.json

All results: 
| sampler_name    |   ('metric', 'drop') |   ('metric', 'gpqa') |   ('metric', 'math') |   ('metric', 'mgsm') |   ('metric', 'mmlu') |
|:----------------|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
| doubao-lite-32k |              48.4181 |             0.180303 |               0.2712 |             0.265818 |               0.5384 |

可以看到分数很低。

/tmp/drop_doubao-lite-32k.json

{
  "em_score": 0.4315,
  "em_score:std": 0.49528552371334256,
  "f1_score": 48.418095,
  "f1_score:std": 48.14274043322601,
  "score:std": 0.499270467782744,
  "score": 0.527
}

/tmp/gpqa_doubao-lite-32k.json

{
  "chars": 980.6404040404041,
  "chars:std": 984.3785210532318,
  "score:std": 0.384439653998615,
  "score": 0.1803030303030303
}

/tmp/math_doubao-lite-32k.json

{
  "score:std": 0.4445790818290937,
  "score": 0.2712
}

/tmp/mgsm_doubao-lite-32k.json

{
  "bn": 0.108,
  "bn:std": 0.31038041175306147,
  "group_non_latin": 0.30533333333333335,
  "group_non_latin:std": 0.4605484652985925,
  "score:std": 0.4417678983731818,
  "de": 0.212,
  "de:std": 0.40872484632084705,
  "group_latin": 0.2184,
  "group_latin:std": 0.413160307870928,
  "en": 0.44,
  "en:std": 0.4963869458396343,
  "es": 0.236,
  "es:std": 0.4246221850068599,
  "fr": 0.196,
  "fr:std": 0.3969685126052191,
  "ja": 0.484,
  "ja:std": 0.499743934430424,
  "ru": 0.36,
  "ru:std": 0.48,
  "sw": 0.008,
  "sw:std": 0.08908422980528034,
  "te": 0.072,
  "te:std": 0.2584879107424562,
  "th": 0.184,
  "th:std": 0.38748419322599476,
  "zh": 0.624,
  "zh:std": 0.4843800161030593,
  "score": 0.26581818181818184
}

/tmp/mmlu_doubao-lite-32k.json

{
  "other": 0.5790349417637272,
  "other:std": 0.49371396372839627,
  "score:std": 0.498523259236718,
  "stem": 0.4224299065420561,
  "stem:std": 0.49394623249998165,
  "humanities": 0.5196195005945303,
  "humanities:std": 0.4996149269151404,
  "social_sciences": 0.6405353728489483,
  "social_sciences:std": 0.4798435255145234,
  "score": 0.5384
}

sampler_name	drop	gpqa	gsm8k	mgsm	mmlu	math
doubao-lite-32k	52.7	18.0	63.8	62.4	64.1	27.1

背景​

过程​

方法​

结果​

背景

过程

方法

结果