Release v0.2.1

@ashors1

🚀 Release v0.2.1

🎉 Official Open Source Release!

We are thrilled to announce that NeMo RL is now officially open source! We welcome the community to use and contribute to it to help shape the future of reinforcement learning.

✨ Highlights

🎯 DeepScaleR Reproducer in NeMo RL

This release features a reproducer for the DeepScaleR work by Agentica AI, where a 1.5B parameter model surpassed O1-Preview on the AIME benchmark (Pass@1). Our implementation replicates this by iteratively scaling DeepSeek's GRPO algorithm from 8K → 16K → 24K context lengths.

You can start the first stage of training (8K context window) using the following command:

uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml

For the complete 3-stage iterative training instructions and more details, please see our GRPO on DeepScaleR guide.

📐 OpenMathInstruct-2 SFT in NeMo RL

This release includes a Supervised Fine-Tuning (SFT) recipe that follows the OpenMathInstruct-2 paper. Using this recipe, training a Llama-3.1-8B model on the train_1M split of the nvidia/OpenMathInstruct-2 dataset achieves a score of 0.5020 on the MATH-500 benchmark, matching the reference implementation in NeMo-Skills.

You can run the OpenMathInstruct-2 recipe using the following command:

uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml

For more details on dataset splits, training times, and evaluation, please see our SFT on OpenMathInstruct-2 guide.

⚡ Faster GRPO with Dynamic Batching

GRPO E2E performance has been significantly improved with the introduction of dynamic batching. This feature optimizes GPU utilization by sorting variable-length responses by sequence length and bucketing them into microbatches. These microbatches aim to have a total number of tokens close to train_mb_tokens and logprob_mb_tokens for the training and logprob stages, respectively.

Important: Dynamic batching requires dtensor to be enabled.

You can enable dynamic batching and dtensor in your YAML configuration like so:

policy:
  # Enable DTensor (required for dynamic batching)
  dtensor_cfg:
    enabled: True
    # Other dtensor settings like tensor_parallel_size, sequence_parallel, etc.
    # tensor_parallel_size: 1
    # sequence_parallel: False
    # activation_checkpointing: True

  # Dynamic batching settings
  dynamic_batching:
    enabled: True
    # Target number of tokens for training microbatches
    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
    # Target number of tokens for logprob microbatches
    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
    # Round sequence lengths to the nearest multiple of this value for bucketing
    sequence_length_round: 64
  # Other policy settings like max_total_sequence_length, train_micro_batch_size, etc.
  # max_total_sequence_length: 4096
  # train_micro_batch_size: 4
  # logprob_batch_size: 8

Alternatively, you can enable these features and configure them via command-line overrides when running a script (e.g., run_grpo_math.py):

uv run examples/run_grpo_math.py \
  --config=<your_base_config.yaml> \
  policy.dtensor_cfg.enabled=True \
  policy.dynamic_batching.enabled=True \
  # Optionally override other dynamic batching or dtensor parameters:
  # policy.dynamic_batching.train_mb_tokens=16384 \
  # policy.dynamic_batching.logprob_mb_tokens=32768 \
  # policy.dtensor_cfg.tensor_parallel_size=2

Make sure to adjust train_mb_tokens, logprob_mb_tokens, and other parameters according to your sequence length and batch size configuration.

💎 Broad Model Support (including Gemma3)

NeMo RL enables users to leverage powerful open models from families such as Qwen, Llama, and Gemma for reinforcement learning. For this v0.2.1 release, we've enhanced support, particularly for Gemma3 models, addressing their unique characteristics like tied weights across all model sizes (which require special handling for tensor parallelism) and specific vLLM initialization needs. NeMo RL automatically handles these model quirks to ensure seamless training and inference. For more details on this, please see our Model Quirks guide.

🛠️ Bug Fixes

Gradient Accumulation: Resolved a common issue where naive averaging of losses during gradient accumulation, especially with varying sequence lengths, led to inaccurate loss calculations; this fix (see #266) ensures training runs and loss calculations are performed accurately.

📊 Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

fix: Fix fsdp1 grad clipping and log grad norm by @ashors1 in #251
fix: Update DPO and SFT configs to use dtensor by @ashors1 in #256
chore: better logging when insufficient resources by @terrykong in #271
feat: E2E multi-turn RL example with a sliding puzzle game by @SahilJain314 in #242
docs: instruct users to git clone before beginning by @terrykong in #257
fix: add bibtex entry by @parthchadha in #273
feat: Updated Name to NeMo RL by @SahilJain314 in #265
docs: Correcting build issues and CI by @aschilling-nv in #270
fix: improve port selection and exiting early from ray.sub by @terrykong in #272
feat: publish convergence/release runs by @terrykong in #214
fix: fixes #264 where tied weights check didn't work on fsdp1 by @parthchadha in #284
feat: Add hydra style overrides to SFT by @hemildesai in #208
feat: rename ratio_eps_{min/max} to ratio_clip_{min/max} for clarity by @SahilJain314 in #283
ci: add eval functional test by @yuki-666 in #269
chore: add isort rules and pyflakes in ruff/precommit by @terrykong in #291
test: add a test that checks if recipes can be merged into the base config by @terrykong in #288
feat: Remove 'last 100' hack for math verifier by @SahilJain314 in #287
chore: Remove online hf checkpointing by @ashors1 in #285
fix: Fixed max seqlen not respected correctly by @SahilJain314 in #299
chore: Remove outdated comment in DPO config by @ashors1 in #293
fix: fix dtype of empty token_ids for consistency by @ashors1 in #290
ci: Add initial code coverage report by @chtruong814 in #268
feat: add qwen3 support by @gshennvm in #289
feat: config.json -> config.yaml to keep configs in the same representation by @terrykong in #314
fix: Step LR scheduler once per grpo step by @ashors1 in #305
perf: update sft and dpo recipes to use bf16 by @ashors1 in #302
fix: Add division by temperature in training model by @parthchadha in #316
ci: Add DPO convergence recipes by @ashors1 in #297
docs: large tech doc edit by @terrykong in #303
feat: mute math verify to dev null by @SahilJain314 in #319
docs: Add an example for saving a HF checkpoint E2E by @terrykong in #320
fix: Fixed capitalization of 'NVIDIA/nemo-rl' -> 'NVIDIA/NeMo-RL' in URL refs. by @SahilJain314 in #330
feat: Add support for gemma-3 by @yfw in #298
test: switch tests to qwen3 0.6B by @terrykong in #315
docs: fix the front page readme heading levels by @terrykong in #336
fix: Loosen threshold for dpo functional test by @ashors1 in #344
feat: Add deepscaler dataset by @abukharin-nv in #335
fix: reinitialize ray cluster if required by @parthchadha in #341
feat: dual-clip in grpo loss by @ZhiyuLi-Nvidia in #311
feat: improve eval by @yuki-666 in #325
fix: sliding_window_overwrite by @ZhiyuLi-Nvidia in #331
docs: add docs for local concurrent clusters and fix paths by @terrykong in #346
feat: pin our python to 3.12 since python 3.13 can break ray by @terrykong in #343
fix: fix accumulation of loss across microbatches by @ashors1 in #266
docs: update readme.md features by @snowmanwwg in #348
fix: update the comment about why we init in fp32 (354) by @ko3n1g in #356
Cherry pick feat: add and log a very rough entropy approximation (342) into r0.2.1 by @ko3n1g in #358
fix: fix issues preventing running grpo on volta (294) by @ko3n1g in #359
docs: remove license that was erroneously copy-pasted (357) by @ko3n1g in #364
Cherry pick fix: recipes missing args (365) into r0.2.1 by @ko3n1g in #372
Cherry pick fix: add missing multi-turn, container information in README (369) into r0.2.1 by @ko3n1g in #376
Cherry pick fix: Save last checkpoint (368) into r0.2.1 by @ko3n1g in #380
Cherry pick feat: Handle Gemma3 special cases in code (379) into r0.2.1 by @ko3n1g in #386
Cherry pick feat: Fixed metric calculation and made all grpo metrics token-level (373) into r0.2.1 by @ko3n1g in #390
Cherry pick feat: SFT on OpenMathInstruct-2 (360) into r0.2.1 by @ko3n1g in #393
Cherry pick feat: add aime24 validation set (388) into r0.2.1 by @ko3n1g in #396
Cherry pick feat: add deepscaler guide (391) into r0.2.1 by @ko3n1g in #397
Cherry pick feat: dynamic batching for training and log prob stages (274) into r0.2.1 by @ko3n1g in #400
Cherry pick docs: deepscaler guide on sidebar (401) into r0.2.1 by @ko3n1g in #402

New Contributors

@ZhiyuLi-Nvidia made their first contribution in #311
@snowmanwwg made their first contribution in #348
@ko3n1g made their first contribution in #356

Full Changelog: v0.2.0...v0.2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!