You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):
I meet the same problem.
When I use lighteval==0.8.1, the test accuracy of AIME2024 on DeepSeek-R1-Distill-Qwen-1.5b is 26%. However, when I upgrade the lighteval to 0.10.0, the AIME2024 on DeepSeek-R1-Distill-Qwen-1.5 b is approximately 36%. I haven't changed anything else.
What's more, the acc of AMC2023 reported in official docs is about 62.9%. But when using lighteval==0.10.0, the result is as follows:
Hi folks,
I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):
i.e. I am seeing
math_pass@1:64_samples
5.47 versus the expected 43.9 (expected based on https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results).(cross-posting this from Open-R1 huggingface/open-r1#655 since I am not sure who the owner is for reasoning evals)
The text was updated successfully, but these errors were encountered: