Releases: NVIDIA/NeMo-RL
Release v0.2.1
🚀 Release v0.2.1
🎉 Official Open Source Release!
We are thrilled to announce that NeMo RL is now officially open source! We welcome the community to use and contribute to it to help shape the future of reinforcement learning.
✨ Highlights
🎯 DeepScaleR Reproducer in NeMo RL
This release features a reproducer for the DeepScaleR work by Agentica AI, where a 1.5B parameter model surpassed O1-Preview on the AIME benchmark (Pass@1). Our implementation replicates this by iteratively scaling DeepSeek's GRPO algorithm from 8K → 16K → 24K context lengths.
You can start the first stage of training (8K context window) using the following command:
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
For the complete 3-stage iterative training instructions and more details, please see our GRPO on DeepScaleR guide.
📐 OpenMathInstruct-2 SFT in NeMo RL
This release includes a Supervised Fine-Tuning (SFT) recipe that follows the OpenMathInstruct-2 paper. Using this recipe, training a Llama-3.1-8B model on the train_1M
split of the nvidia/OpenMathInstruct-2 dataset achieves a score of 0.5020 on the MATH-500 benchmark, matching the reference implementation in NeMo-Skills.
You can run the OpenMathInstruct-2 recipe using the following command:
uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml
For more details on dataset splits, training times, and evaluation, please see our SFT on OpenMathInstruct-2 guide.
⚡ Faster GRPO with Dynamic Batching
GRPO E2E performance has been significantly improved with the introduction of dynamic batching. This feature optimizes GPU utilization by sorting variable-length responses by sequence length and bucketing them into microbatches. These microbatches aim to have a total number of tokens close to train_mb_tokens
and logprob_mb_tokens
for the training and logprob stages, respectively.
Important: Dynamic batching requires dtensor
to be enabled.
You can enable dynamic batching and dtensor
in your YAML configuration like so:
policy:
# Enable DTensor (required for dynamic batching)
dtensor_cfg:
enabled: True
# Other dtensor settings like tensor_parallel_size, sequence_parallel, etc.
# tensor_parallel_size: 1
# sequence_parallel: False
# activation_checkpointing: True
# Dynamic batching settings
dynamic_batching:
enabled: True
# Target number of tokens for training microbatches
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
# Target number of tokens for logprob microbatches
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
# Round sequence lengths to the nearest multiple of this value for bucketing
sequence_length_round: 64
# Other policy settings like max_total_sequence_length, train_micro_batch_size, etc.
# max_total_sequence_length: 4096
# train_micro_batch_size: 4
# logprob_batch_size: 8
Alternatively, you can enable these features and configure them via command-line overrides when running a script (e.g., run_grpo_math.py
):
uv run examples/run_grpo_math.py \
--config=<your_base_config.yaml> \
policy.dtensor_cfg.enabled=True \
policy.dynamic_batching.enabled=True \
# Optionally override other dynamic batching or dtensor parameters:
# policy.dynamic_batching.train_mb_tokens=16384 \
# policy.dynamic_batching.logprob_mb_tokens=32768 \
# policy.dtensor_cfg.tensor_parallel_size=2
Make sure to adjust train_mb_tokens
, logprob_mb_tokens
, and other parameters according to your sequence length and batch size configuration.
💎 Broad Model Support (including Gemma3)
NeMo RL enables users to leverage powerful open models from families such as Qwen, Llama, and Gemma for reinforcement learning. For this v0.2.1 release, we've enhanced support, particularly for Gemma3 models, addressing their unique characteristics like tied weights across all model sizes (which require special handling for tensor parallelism) and specific vLLM initialization needs. NeMo RL automatically handles these model quirks to ensure seamless training and inference. For more details on this, please see our Model Quirks guide.
🛠️ Bug Fixes
- Gradient Accumulation: Resolved a common issue where naive averaging of losses during gradient accumulation, especially with varying sequence lengths, led to inaccurate loss calculations; this fix (see #266) ensures training runs and loss calculations are performed accurately.
📊 Release Runs
We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.
To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.
What's Changed
- fix: Fix fsdp1 grad clipping and log grad norm by @ashors1 in #251
- fix: Update DPO and SFT configs to use dtensor by @ashors1 in #256
- chore: better logging when insufficient resources by @terrykong in #271
- feat: E2E multi-turn RL example with a sliding puzzle game by @SahilJain314 in #242
- docs: instruct users to git clone before beginning by @terrykong in #257
- fix: add bibtex entry by @parthchadha in #273
- feat: Updated Name to NeMo RL by @SahilJain314 in #265
- docs: Correcting build issues and CI by @aschilling-nv in #270
- fix: improve port selection and exiting early from ray.sub by @terrykong in #272
- feat: publish convergence/release runs by @terrykong in #214
- fix: fixes #264 where tied weights check didn't work on fsdp1 by @parthchadha in #284
- feat: Add hydra style overrides to SFT by @hemildesai in #208
- feat: rename ratio_eps_{min/max} to ratio_clip_{min/max} for clarity by @SahilJain314 in #283
- ci: add eval functional test by @yuki-666 in #269
- chore: add isort rules and pyflakes in ruff/precommit by @terrykong in #291
- test: add a test that checks if recipes can be merged into the base config by @terrykong in #288
- feat: Remove 'last 100' hack for math verifier by @SahilJain314 in #287
- chore: Remove online hf checkpointing by @ashors1 in #285
- fix: Fixed max seqlen not respected correctly by @SahilJain314 in #299
- chore: Remove outdated comment in DPO config by @ashors1 in #293
- fix: fix dtype of empty
token_ids
for consistency by @ashors1 in #290 - ci: Add initial code coverage report by @chtruong814 in #268
- feat: add qwen3 support by @gshennvm in #289
- feat: config.json -> config.yaml to keep configs in the same representation by @terrykong in #314
- fix: Step LR scheduler once per grpo step by @ashors1 in #305
- perf: update sft and dpo recipes to use bf16 by @ashors1 in #302
- fix: Add division by temperature in training model by @parthchadha in #316
- ci: Add DPO convergence recipes by @ashors1 in #297
- docs: large tech doc edit by @terrykong in #303
- feat: mute math verify to dev null by @SahilJain314 in #319
- docs: Add an example for saving a HF checkpoint E2E by @terrykong in #320
- fix: Fixed capitalization of 'NVIDIA/nemo-rl' -> 'NVIDIA/NeMo-RL' in URL refs. by @SahilJain314 in #330
- feat: Add support for gemma-3 by @yfw in #298
- test: switch tests to qwen3 0.6B by @terrykong in #315
- docs: fix the front page readme heading levels by @terrykong in #336
- fix: Loosen threshold for dpo functional test by @ashors1 in #344
- feat: Add deepscaler dataset by @abukharin-nv in #335
- fix: reinitialize ray cluster if required by @parthchadha in #341
- feat: dual-clip in grpo loss by @ZhiyuLi-Nvidia in #311
- feat: improve eval by @yuki-666 in #325
- fix: sliding_window_overwrite by @ZhiyuLi-Nvidia in #331
- docs: add docs for local concurrent clusters and fix paths by @terrykong in #346
- feat: p...
Release v0.2.0
🚀 Release v0.2.0
⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training
The headline feature of v0.2 is the new DTensorPolicyWorker
.
It enables advanced parallelisms—FSDP 2, Tensor Parallelism, and Sequence Parallelism—letting us scale to 32 B-parameter models.
Enable it via YAML or CLI overrides:
policy.dtensor_cfg.enabled=True \
policy.dtensor_cfg.tensor_parallel_size=8 \
policy.dtensor_cfg.sequence_parallel=True \
policy.dtensor_cfg.activation_checkpointing=True
🧠 Learning Algorithms — DPO (Direct Preference Optimization)
Our algorithm suite now includes DPO, compatible with both FSDP 1 and DTensor.
uv run examples/run_dpo.py
More examples live in the docs.
🔄 Multi-Turn RL — Tool Use, Games & Beyond
We now support multi-turn generation and training with GRPO.
An E2E example of training to play a sliding puzzle game will be available in the next release, but you can try it by cherry-picking this PR: #242
# 8x80GB GPUs recommended
uv run python examples/run_grpo_sliding_puzzle.py
🏋️♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length
FSDP 2 + TP + SP make RL and SFT on 32 B models possible:
uv run ./examples/run_grpo_math.py \
--config examples/configs/grpo_math_8B.yaml \
policy.model_name='Qwen/Qwen2.5-32B' \
policy.generation.vllm_cfg.tensor_parallel_size=4 \
policy.max_total_sequence_length=16384 \
cluster.num_nodes=16 \
policy.dtensor_cfg.enabled=True \
policy.dtensor_cfg.tensor_parallel_size=8 \
policy.dtensor_cfg.sequence_parallel=True \
policy.dtensor_cfg.activation_checkpointing=True
Full multi-node walkthrough in the docs.
🛡️ Environment Isolation — Per-Worker Deps with uv
In NeMo RL, workers can now launch cached, isolated uv
virtual environments with their own Python dependencies—a setup we’ve found to be significantly faster than Ray’s builtin conda/pip/uv flow. Details here.
🐞 Known Issues
- FSDP 1 gradient-clipping bug — see #251
- Qwen 32 B perf tweaks coming in the next patch
- Gemma3 convergence: #236
- SFT/DPO configs default to FSDP1, which is not recommended for 1B models with tied embeddings. #256. Enabling DTensor manually will resolve the error.
- V100 configuration: #259
- The default SFT and DPO configs in
examples/configs
setpolicy.dtensor_cfg.enabled=False
, but dtensor must be enabled to run with the default 1B models. Please make sure to setpolicy.dtensor_cfg.enabled=True
when running with the default SFT and DPO configs.
📊 Release Runs
We have provided tensorboard logs to release runs to give you a head start on what to expect from our recipes.
You may download them here and serve them with tensorboard:
mkdir v0.2.0
tar -xzf release_runs.tar.gz -C v0.2.0/
tensorboard serve --logdir v0.2.0/
🚧 Coming soon… : In future releases, we will share a tensorboard viewer to make it easier to view and compare release runs.
What's Changed
- fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
- feat: add gpu mem and util logging to wandb/tensorboard by @terrykong in #37
- ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
- fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
- fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32
- fix: Remove reference of tokenizer from generation backend (#75) by @parthchadha in #82
- feat: unit test metric tracking by @terrykong in #40
- fix: unit test error when coverage wasn't specified by @terrykong in #88
- ci: temporarily disable CI on main since PRs must be up to date before merge by @terrykong in #91
- fix: error out early if ray cluster does not have resources by @parthchadha in #89
- ci: skip functional until more capacity available and/or tests speed up by @terrykong in #94
- feat: evaluation implement by @yuki-666 in #16
- fix: gradient should be averaged instead of summed across mbs by @parthchadha in #86
- fix: Use separate step_metric for GPU Monitoring by @yfw in #92
- feat: Update sft config to use single GPU by @ashors1 in #90
- fix: Grammar nit by @SahilJain314 in #98
- feat: add capability to set min/max eps separately as proposed in the… by @parthchadha in #95
- fix: change format messages to out of place by @KiddoZhu in #77
- fix: correct version and use setuptools.dynamic metadata for version/readme by @terrykong in #104
- fix: remove usage of vllm to get device uuid and instead use nvidia-m… by @parthchadha in #105
- fix: Change optional-dependencies to dependency-groups by @hemildesai in #81
- feat: Add support for hydra style overrides by @hemildesai in #80
- fix: Do not initialize reference model for sft by @ashors1 in #71
- fix: change grpo default to use 64 prompts per step and 32 generation… by @parthchadha in #111
- feat: use cuda_graph by default for vllm by @parthchadha in #116
- fix: ensure that we check for pad_token and not assume pad_token==eos… by @parthchadha in #120
- ci: Consolidate tests by @chtruong814 in #27
- feat: support local venvs for dependency isolation by @terrykong in #102
- fix: make message formatting compatible with tokenizers with no bos/eos token by @ashors1 in #118
- fix: reset prefix cache when sleep is called to ensure prefix cache i… by @parthchadha in #112
- ci: Fix unit test summary by @chtruong814 in #128
- fix: fix error padding by @yuki-666 in #87
- feat: Distributed checkpointing by @ashors1 in #99
- ci: Add DCO placeholder check for merge queue by @chtruong814 in #147
- ci: Clarify DCO check in merge_group by @chtruong814 in #154
- fix: host ip resolution uses ray vs socket by @terrykong in #153
- test: Add grpo/reinforce/ppo loss tests (prep for incoming vocab parallel changes) by @SahilJain314 in #162
- fix: always test vllm by @parthchadha in #167
- docs: Fix doc build warnings and add external CI config by @mckimn in #157
- fix: allow configuring ray ports in ray.sub in case conflict on cluster by @terrykong in #173
- feat: support arbitrary end_strings by @yuki-666 in #96
- ci: labels for docs/L0/L1/L2 and run even if only doc test by @terrykong in #181
- fix: don't use cuda-graphs for vllm generation by @parthchadha in #187
- ci: Update to include public/ folder for pages deployment by @mckimn in #182
- docs: run tests with --group test to avoid missing test deps by @terrykong in #188
- fix: default to less verbose logging + uv-venv log once per worker by @terrykong in #141
- docs: Correcting file names by @aschilling-nv in #161
- fix: convert DCP to HF script works without ray cluster by @terrykong in #185
- docs: remove backticks from uv.md title by @terrykong in #179
- feat: add a unique seed for each vllm llm engine by @parthchadha in #171
- fix: unit test script halts on first failure by @terrykong in #189
- feat: Upgrade to vllm v1 runtime by @parthchadha in #170
- ci: R...
Release v0.1.1
Release v0.1.1
Patch release on top of v0.1.0
🛠️ More stable mixed precision configurations and resolves OOMs observed in llama 8b
🛠️ Fixes race condition in ray.sub
where pyxis can fail if subsequent srun
commands are run too early (with --overlap
)
What's Changed
- fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
- feat: add gpu mem and util logging to w 8000 andb/tensorboard by @terrykong in #37
- ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
- fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
- fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32
Known Issues
- gpu memory and utilization in wandb/tensorboard has a bug when enabled. This is tracked in #83
Full Changelog: v0.1.0...v0.1.1
Release v0.1.0
Release v0.1.0
- ✅ Fast Generation - vLLM backend for optimized inference
- ✅ HuggingFace Integration - Works with 1-8B models (Qwen1.5, Llama)
- ✅ Distributed Training - FSDP support and Ray-based infrastructure
- ✅ Environment Support - Support for multi-environment training.
- ✅ Learning Algorithms - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
- ✅ Worker Isolation - Process isolation between RL Actors (no worries about global state)
What's Changed
- ci: Add initial GHA by @chtruong814 in #1
- feat: reinforcer initial commit by @terrykong in #3
- Checkpointing fixes by @ashors1 in #9
- docs: Move adding_new_models doc to guides section by @parthchadha in #11
- fix: disable mixed precision training until #13 is resolved by @parthchadha in #14
- docs: Small update to sft documentation by @ashors1 in #12
- ci: Update unit tests to run on self-hosted runner by @chtruong814 in #6
- feat: SFT improvements: refactor and add validation and checkpointing by @ashors1 in #15
- docs: GRPO documentation and Configuration cleanup by @SahilJain314 in #7
- feat: lots of fixes by @terrykong in #17
- feat: Configurable precision by @SahilJain314 in #19
- ci: OPTIONAL -> IS_OPTIONAL by @terrykong in #22
- feat: disable ray usage collection stats be default by @terrykong in #24
- docs: refresh our PR template by @terrykong in #23
- docs: micro doc update with a helpful reminder on environment variables by @SahilJain314 in #20
- fix: disable usage stats more forcefully since container env took precedence by @terrykong in #25
- feat: Enable amp with autocast (fix poor bf16 convergence on GRPO by @SahilJain314 in #26
- feat: Use openmathinstruct2 training in grpo math example by @parthchadha in 5FA5 #18
- docs: Updated adding models docs to fix latex rendering errors and fix math by @SahilJain314 in #28
- fix: updated stale cluster.md by @terrykong in #30
- feat: SFT convergence run changes by @yfw in #21
- docs: Add SFT quickstart by @ashors1 in #29
- feat: Change vllm frac to 0.6 by @parthchadha in #31
New Contributors
- @chtruong814 made their first contribution in #1
- @terrykong made their first contribution in #3
- @ashors1 made their first contribution in #9
- @parthchadha made their first contribution in #11
- @yfw made their first contribution in #21
Known Issues
- There is a known bug with SFT checkpointing that requires the full model to be gathered on GPU before saving a checkpoint. This causes OOM for larger model sizes. If you run into OOM when checkpointing, disable checkpointing by adding
checkpointing.enabled=False
to your run command.
Full Changelog: https://github.com/NVIDIA/NeMo-RL/commits/v0.1.0