Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from 1 GPU to 1000s, and from Tiny to >100B Parameters
Nemo-Reinforcer is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
What you can expect:
- Seamless integration with HuggingFace for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
- High-performance implementation with Megatron core, supporting various parallelism techniques for large models (>100B) and large context lengths.
- Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
- Flexibility with a modular design that allows easy integration and customization.
- Comprehensive documentation that is both detailed and user-friendly, with practical examples.
✅ Available now | 🔜 Coming in v0.2
- ✅ Fast Generation - vLLM backend for optimized inference
- ✅ HuggingFace Integration - Works with 1-8B models (Qwen1.5, Llama)
- ✅ Distributed Training - FSDP support and Ray-based infrastructure
- ✅ Environment Support - Support for multi-environment training.
- ✅ Learning Algorithms - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
- ✅ Worker Isolation - Process isolation between RL Actors (no worries about global state)
- 🔜 Larger Model Support - Native PyTorch support for models up to 70B parameters
- 🔜 Advanced Parallelism - FSDP2, TP, SP, and sequence packing for efficient training
- 🔜 Environment Isolation - Dependency isolation between components
- 🔜 DPO Algorithm - Direct Preference Optimization for alignment
# For faster setup we use `uv`
pip install uv
# Specify a virtual env that uses Python 3.12
uv venv -p python3.12.9 .venv
# Install NeMo-Reinforcer with vllm
uv pip install -e .
# Install NeMo-Reinforcer with dev/test dependencies
uv pip install -e '.[dev,test]'
# Use uv run to launch any runs.
# Note that it is recommended to not activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py
Reminder: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a huggingface-cli login
as well for Llama models.
We provide a sample SFT experiment that uses the SQuAD dataset.
The experiment is set up to run on 8 GPUs. If using a machine that has access to 8 GPUs, you can launch the experiment as follows:
uv run python examples/run_sft.py
This trains Llama3.1-8B
on 8 GPUs. To run on a single GPU, we'll have to override a few of the experiment settings. We replace the 8B model with a smaller 1B model, decrease the batch size, and update the cluster configuration to use a single gpu:
uv run python examples/run_sft.py \
policy.model_name="meta-llama/Llama-3.2-1B" \
policy.train_global_batch_size=16 \
sft.val_global_batch_size=16 \
cluster.gpus_per_node=1
Refer to sft.yaml for a full list of parameters that can be overridden.
For distributed training across multiple nodes:
Set UV_CACHE_DIR
to a directory that can be read from all workers before running any uv run command.
export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# SFT experiment uses Llama-3.1-8B model
COMMAND="uv pip install -e .; uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.
To run GRPO on a single GPU for Llama-3.2-1B-Instruct
:
# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py
By default, this uses the configuration in examples/configs/grpo_math_1B.yaml
. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
cluster.gpus_per_node=8
You can override any of the parameters listed in the yaml configuration file. For example,
uv run python examples/run_grpo_math.py \
policy.model_name="Qwen/Qwen2-1.5B" \
checkpointing.checkpoint_dir="results/qwen1_5b_math" \
logger.wandb_enabled=True \
logger.wandb.name="grpo-qwen1_5b_math" \
logger.num_val_samples_to_print=10 \
For the general multi-node setup, refer to the SFT multi-node documentation. The only thing that differs from SFT is the COMMAND
:
# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv pip install -e .; uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' policy.train_global_batch_size=64 logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
Please visit Cluster Start for how to get started on Slurm or Kubernetes.