Crowdsourcing the recipe search for compute-optimal LLM RL with verifiable rewards.
<think>
Okay, so we want to figure out the best way to train LLMs to do RL with verifiable rewards.
I remember from the DeepSeek R1 paper that GRPO with correctness and format rewards can learn to solve math and code problems.
But, I don't think they've reported hyperparameter ablations for GRPO beyond the values used in the earlier DeepSeekMath paper. It's also not clear how much this extends to problems beyond math and code, particularly with smaller models.
Hmm, so there's a number of LLM RL training libraries which might be useful here, each with their own different implementation choices:
There's also some benchmarks that are commonly studied in practice, but there isn't really an agreed-upon gold standard set:
- Math (MATH, AIME, FrontierMath)
- Coding (HumanEval, IOI, CodeForces)
- Logic puzzles (Alice in Wonderland, Sudoku, Temporal Clue)
- Hard knowledge tasks (MMLU-Pro, GPQA, HLE)
Wait, but there's also a lot of open training datasets for reasoning problems that might be useful, from teams like:
Additionally, I remember reading a number of papers and blogs talking about how RL algorithms like PPO can be very sensitive to implementation details. This seems especially important figure out for LLMs, given how expensive and time-consuming they are to train.
I also remember that the nanoGPT project from Andrej Karpathy spawned a productive search for pretraining methods, and yielded a number of notable performance breakthroughs like Muon, documented in Keller Jordan's modded-nanogpt.
What if we had a similar kind of competitive-collaborative leaderboard for RL and reasoning?
[To be continued...]
</think>