8000 GitHub - NVIDIA/NeMo-RL at v0.1.0
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

NVIDIA/NeMo-RL

Repository files navigation

Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from 1 GPU to 1000s, and from Tiny to >100B Parameters

Nemo-Reinforcer is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

  • Seamless integration with HuggingFace for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
  • High-performance implementation with Megatron core, supporting various parallelism techniques for large models (>100B) and large context lengths.
  • Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
  • Flexibility with a modular design that allows easy integration and customization.
  • Comprehensive documentation that is both detailed and user-friendly, with practical examples.

Features

✅ Available now | 🔜 Coming in v0.2

  • Fast Generation - vLLM backend for optimized inference
  • HuggingFace Integration - Works with 1-8B models (Qwen1.5, Llama)
  • Distributed Training - FSDP support and Ray-based infrastructure
  • Environment Support - Support for multi-environment training.
  • Learning Algorithms - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
  • Worker Isolation - Process isolation between RL Actors (no worries about global state)
  • 🔜 Larger Model Support - Native PyTorch support for models up to 70B parameters
  • 🔜 Advanced Parallelism - FSDP2, TP, SP, and sequence packing for efficient training
  • 🔜 Environment Isolation - Dependency isolation between components
  • 🔜 DPO Algorithm - Direct Preference Optimization for alignment

Installation

# For faster setup we use `uv`
pip install uv

# Specify a virtual env that uses Python 3.12
uv venv -p python3.12.9 .venv
# Install NeMo-Reinforcer with vllm
uv pip install -e .
# Install NeMo-Reinforcer with dev/test dependencies
uv pip install -e '.[dev,test]'

# Use uv run to launch any runs. 
# Note that it is recommended to not activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py

Quick start

Reminder: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a huggingface-cli login as well for Llama models.

SFT

We provide a sample SFT experiment that uses the SQuAD dataset.

Single Node

The experiment is set up to run on 8 GPUs. If using a machine that has access to 8 GPUs, you can launch the experiment as follows:

uv run python examples/run_sft.py

This trains Llama3.1-8B on 8 GPUs. To run on a single GPU, we'll have to override a few of the experiment settings. We replace the 8B model with a smaller 1B model, decrease the batch size, and update the cluster configuration to use a single gpu:

uv run python examples/run_sft.py \
  policy.model_name="meta-llama/Llama-3.2-1B" \
  policy.train_global_batch_size=16 \
  sft.val_global_batch_size=16 \
  cluster.gpus_per_node=1

Refer to sft.yaml for a full list of parameters that can be overridden.

Multi-node

For distributed training across multiple nodes:

Set UV_CACHE_DIR to a directory that can be read from all workers before running any uv run command.

export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# SFT experiment uses Llama-3.1-8B model
COMMAND="uv pip install -e .; uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

GRPO

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

Single Node

To run GRPO on a single GPU for Llama-3.2-1B-Instruct:

# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 gpus,

# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
  cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py \
  policy.model_name="Qwen/Qwen2-1.5B" \
  checkpointing.checkpoint_dir="results/qwen1_5b_math" \
  logger.wandb_enabled=True \
  logger.wandb.name="grpo-qwen1_5b_math" \
  logger.num_val_samples_to_print=10 \

Multi-node

For the general multi-node setup, refer to the SFT multi-node documentation. The only thing that differs from SFT is the COMMAND:

# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv pip install -e .; uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' policy.train_global_batch_size=64 logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

Cluster Start

Please visit Cluster Start for how to get started on Slurm or Kubernetes.

About

Scalable toolkit for efficient model reinforcement

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 21

0