8000 Release Release v0.2.1 Β· NVIDIA-NeMo/RL Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Release v0.2.1

Latest
Compare
Choose a tag to compare
@terrykong terrykong released this 15 May 22:03
· 102 commits to main since this release
81d421f

πŸš€ Release v0.2.1

πŸŽ‰ Official Open Source Release!

We are thrilled to announce that NeMo RL is now officially open source! We welcome the community to use and contribute to it to help shape the future of reinforcement learning.

✨ Highlights

🎯 DeepScaleR Reproducer in NeMo RL

This release features a reproducer for the DeepScaleR work by Agentica AI, where a 1.5B parameter model surpassed O1-Preview on the AIME benchmark (Pass@1). Our implementation replicates this by iteratively scaling DeepSeek's GRPO algorithm from 8K β†’ 16K β†’ 24K context lengths.

You can start the first stage of training (8K context window) using the following command:

uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml

For the complete 3-stage iterative training instructions and more details, please see our GRPO on DeepScaleR guide.

πŸ“ OpenMathInstruct-2 SFT in NeMo RL

This release includes a Supervised Fine-Tuning (SFT) recipe that follows the OpenMathInstruct-2 paper. Using this recipe, training a Llama-3.1-8B model on the train_1M split of the nvidia/OpenMathInstruct-2 dataset achieves a score of 0.5020 on the MATH-500 benchmark, matching the reference implementation in NeMo-Skills.

You can run the OpenMathInstruct-2 recipe using the following command:

uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml

For more details on dataset splits, training times, and evaluation, please see our SFT on OpenMathInstruct-2 guide.

⚑ Faster GRPO with Dynamic Batching

GRPO E2E performance has been significantly improved with the introduction of dynamic batching. This feature optimizes GPU utilization by sorting variable-length responses by sequence length and bucketing them into microbatches. These microbatches aim to have a total number of tokens close to train_mb_tokens and logprob_mb_tokens for the training and logprob stages, respectively.

Important: Dynamic batching requires dtensor to be enabled.

You can enable dynamic batching and dtensor in your YAML configuration like so:

policy:
  # Enable DTensor (required for dynamic batching)
  dtensor_cfg:
    enabled: True
    # Other dtensor settings like tensor_parallel_size, sequence_parallel, etc.
    # tensor_parallel_size: 1
    # sequence_parallel: False
    # activation_checkpointing: True

  # Dynamic batching settings
  dynamic_batching:
    enabled: True
    # Target number of tokens for training microbatches
    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
    # Target number of tokens for logprob microbatches
    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
    # Round sequence lengths to the nearest multiple of this value for bucketing
    sequence_length_round: 64
  # Other policy settings like max_total_sequence_length, train_micro_batch_size, etc.
  # max_total_sequence_length: 4096
  # train_micro_batch_size: 4
  # logprob_batch_size: 8

Alternatively, you can enable these features and configure them via command-line overrides when running a script (e.g., run_grpo_math.py):

uv run examples/run_grpo_math.py \
  --config=<your_base_config.yaml> \
  policy.dtensor_cfg.enabled=True \
  policy.dynamic_batching.enabled=True \
  # Optionally override other dynamic batching or dtensor parameters:
  # policy.dynamic_batching.train_mb_tokens=16384 \
  # policy.dynamic_batching.logprob_mb_tokens=32768 \
  # policy.dtensor_cfg.tensor_parallel_size=2

Make sure to adjust train_mb_tokens, logprob_mb_tokens, and other parameters according to your sequence length and batch size configuration.

πŸ’Ž Broad Model Support (including Gemma3)

NeMo RL enables users to leverage powerful open models from families such as Qwen, Llama, and Gemma for reinforcement learning. For this v0.2.1 release, we've enhanced support, particularly for Gemma3 models, addressing their unique characteristics like tied weights across all model sizes (which require special handling for tensor parallelism) and specific vLLM initialization needs. NeMo RL automatically handles these model quirks to ensure seamless training and inference. For more details on this, please see our Model Quirks guide.

πŸ› οΈ Bug Fixes

  • Gradient Accumulation: Resolved a common issue where naive averaging of losses during gradient accumulation, especially with varying sequence lengths, led to inaccurate loss calculations; this fix (see #266) ensures training runs and loss calculations are performed accurately.

πŸ“Š Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

  • fix: Fix fsdp1 grad clipping and log grad norm by @ashors1 in #251
  • fix: Update DPO and SFT configs to use dtensor by @ashors1 in #256
  • chore: better logging when insufficient resources by @terrykong in #271
  • feat: E2E multi-turn RL example with a sliding puzzle game by @SahilJain314 in #242
  • docs: instruct users to git clone before beginning by @terrykong in #257
  • fix: add bibtex entry by @parthchadha in #273
  • feat: Updated Name to NeMo RL by @SahilJain314 in #265
  • docs: Correcting build issues and CI by @aschilling-nv in #270
  • fix: improve port selection and exiting early from ray.sub by @terrykong in #272
  • feat: publish convergence/release runs by @terrykong in #214
  • fix: fixes #264 where tied weights check didn't work on fsdp1 by @parthchadha in #284
  • feat: Add hydra style overrides to SFT by @hemildesai in #208
  • feat: rename ratio_eps_{min/max} to ratio_clip_{min/max} for clarity by @SahilJain314 in #283
  • ci: add eval functional test by @yuki-666 in #269
  • chore: add isort rules and pyflakes in ruff/precommit by @terrykong in #291
  • test: add a test that checks if recipes can be merged into the base config by @terrykong in #288
  • feat: Remove 'last 100' hack for math verifier by @SahilJain314 in #287
  • chore: Remove online hf checkpointing by @ashors1 in #285
  • fix: Fixed max seqlen not respected correctly by @SahilJain314 in #299
  • chore: Remove outdated comment in DPO config by @ashors1 in #293
  • fix: fix dtype of empty token_ids for consistency by @ashors1 in #290
  • ci: Add initial code coverage report by @chtruong814 in #268
  • feat: add qwen3 support by @gshennvm in #289
  • feat: config.json -> config.yaml to keep configs in the same representation by @terrykong in #314
  • fix: Step LR scheduler once per grpo step by @ashors1 in #305
  • perf: update sft and dpo recipes to use bf16 by @ashors1 in #302
  • fix: Add division by temperature in training model by @parthchadha in #316
  • ci: Add DPO convergence recipes by @ashors1 in #297
  • docs: large tech doc edit by @terrykong in #303
  • feat: mute math verify to dev null by @SahilJain314 in #319
  • docs: Add an example for saving a HF checkpoint E2E by @terrykong in #320
  • fix: Fixed capitalization of 'NVIDIA/nemo-rl' -> 'NVIDIA/NeMo-RL' in URL refs. by @SahilJain314 in #330
  • feat: Add support for gemma-3 by @yfw in #298
  • test: switch tests to qwen3 0.6B by @terrykong in #315
  • docs: fix the front page readme heading levels by @terrykong in #336
  • fix: Loosen threshold for dpo functional test by @ashors1 in #344
  • feat: Add deepscaler dataset by @abukharin-nv in #335
  • fix: reinitialize ray cluster if required by @parthchadha in #341
  • feat: dual-clip in grpo loss by @ZhiyuLi-Nvidia in #311
  • feat: improve eval by @yuki-666 in #325
  • fix: sliding_window_overwrite by @ZhiyuLi-Nvidia in #331
  • docs: add docs for local concurrent clusters and fix paths by @terrykong in #346
  • feat: pin our python to 3.12 since python 3.13 can break ray by @terrykong in #343
  • fix: fix accumulation of loss across microbatches by @ashors1 in #266
  • docs: update readme.md features by @snowmanwwg in #348
  • fix: update the comment about why we init in fp32 (354) by @ko3n1g in #356
  • Cherry pick feat: add and log a very rough entropy approximation (342) into r0.2.1 by @ko3n1g in #358
  • fix: fix issues preventing running grpo on volta (294) by @ko3n1g in #359
  • docs: remove license that was erroneously copy-pasted (357) by @ko3n1g in #364
  • Cherry pick fix: recipes missing args (365) into r0.2.1 by @ko3n1g in #372
  • Cherry pick fix: add missing multi-turn, container information in README (369) into r0.2.1 by @ko3n1g in #376
  • Cherry pick fix: Save last checkpoint (368) into r0.2.1 by @ko3n1g in #380
  • Cherry pick feat: Handle Gemma3 special cases in code (379) into r0.2.1 by @ko3n1g in #386
  • Cherry pick feat: Fixed metric calculation and made all grpo metrics token-level (373) into r0.2.1 by @ko3n1g in #390
  • Cherry pick feat: SFT on OpenMathInstruct-2 (360) into r0.2.1 by @ko3n1g in #393
  • Cherry pick feat: add aime24 validation set (388) into r0.2.1 by @ko3n1g in #396
  • Cherry pick feat: add deepscaler guide (391) into r0.2.1 by @ko3n1g in #397
  • Cherry pick feat: dynamic batching for training and log prob stages (274) into r0.2.1 by @ko3n1g in #400
  • Cherry pick docs: deepscaler guide on sidebar (401) into r0.2.1 by @ko3n1g in #402

New Contributors

Full Changelog: v0.2.0...v0.2.1

0