8000 [BUG] Is AIME24 broken? · Issue #771 · huggingface/lighteval · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[BUG] Is AIME24 broken? #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eldarkurtic opened this issue May 22, 2025 · 2 comments
Open

[BUG] Is AIME24 broken? #771

eldarkurtic opened this issue May 22, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@eldarkurtic
Copy link
Contributor

Hi folks,

I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):

|          Task          |Version|        Metric        |Value |   |Stderr|
|------------------------|------:|----------------------|-----:|---|-----:|
|all                     |       |math_pass@1:1_samples |0.6763|±  |0.0534|
|                        |       |math_pass@1:4_samples |0.6782|±  |0.0388|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|
|                        |       |gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:aime24:0      |      2|math_pass@1:1_samples |0.4667|±  |0.0926|
|                        |       |math_pass@1:4_samples |0.4583|±  |0.0668|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|  <-- 0.0547 vs 0.439 (Open-R1 README)
|lighteval:gpqa:diamond:0|      1|gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:math_500:0    |      2|math_pass@1:1_samples |0.8860|±  |0.0142|
|                        |       |math_pass@1:4_samples |0.8980|±  |0.0108|            

i.e. I am seeing math_pass@1:64_samples 5.47 versus the expected 43.9 (expected based on https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results).

(cross-posting this from Open-R1 huggingface/open-r1#655 since I am not sure who the owner is for reasoning evals)

@eldarkurtic eldarkurtic added the bug Something isn't working label May 22, 2025
@clefourrier
Copy link
Member

cc @lewtun just in case

Copy link

I meet the same problem.
When I use lighteval==0.8.1, the test accuracy of AIME2024 on DeepSeek-R1-Distill-Qwen-1.5b is 26%. However, when I upgrade the lighteval to 0.10.0, the AIME2024 on DeepSeek-R1-Distill-Qwen-1.5 b is approximately 36%. I haven't changed anything else.

What's more, the acc of AMC2023 reported in official docs is about 62.9%. But when using lighteval==0.10.0, the result is as follows:

Task Version Metric Value Stderr
all extractive_match 0.75 ± 0.0693
custom:amc23:0 1 extractive_match 0.75 ± 0.0693

I feel confused about why this happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
0