[BUG] Is AIME24 broken? #771

eldarkurtic · 2025-05-22T09:37:39Z

Hi folks,

I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):

|          Task          |Version|        Metric        |Value |   |Stderr|
|------------------------|------:|----------------------|-----:|---|-----:|
|all                     |       |math_pass@1:1_samples |0.6763|±  |0.0534|
|                        |       |math_pass@1:4_samples |0.6782|±  |0.0388|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|
|                        |       |gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:aime24:0      |      2|math_pass@1:1_samples |0.4667|±  |0.0926|
|                        |       |math_pass@1:4_samples |0.4583|±  |0.0668|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|  <-- 0.0547 vs 0.439 (Open-R1 README)
|lighteval:gpqa:diamond:0|      1|gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:math_500:0    |      2|math_pass@1:1_samples |0.8860|±  |0.0142|
|                        |       |math_pass@1:4_samples |0.8980|±  |0.0108|

i.e. I am seeing math_pass@1:64_samples 5.47 versus the expected 43.9 (expected based on https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results).

(cross-posting this from Open-R1 huggingface/open-r1#655 since I am not sure who the owner is for reasoning evals)

The text was updated successfully, but these errors were encountered:

clefourrier · 2025-05-22T10:19:56Z

cc @lewtun just in case

Kairong-Han · 2025-05-28T12:04:32Z

I meet the same problem.
When I use lighteval==0.8.1, the test accuracy of AIME2024 on DeepSeek-R1-Distill-Qwen-1.5b is 26%. However, when I upgrade the lighteval to 0.10.0, the AIME2024 on DeepSeek-R1-Distill-Qwen-1.5 b is approximately 36%. I haven't changed anything else.

What's more, the acc of AMC2023 reported in official docs is about 62.9%. But when using lighteval==0.10.0, the result is as follows:

Task	Version	Metric	Value		Stderr
all		extractive_match	0.75	±	0.0693
custom:amc23:0	1	extractive_match	0.75	±	0.0693

I feel confused about why this happened.

eldarkurtic added the bug Something isn't working label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Is AIME24 broken? #771

[BUG] Is AIME24 broken? #771

Uh oh!

Uh oh!

[BUG] Is AIME24 broken? #771

[BUG] Is AIME24 broken? #771

Comments

Uh oh!

Uh oh!