8000 GitHub - NVIDIA-NeMo/RL at yifu/mem
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

NVIDIA-NeMo/RL

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Nemo-Reinforcer: A Scalable and Efficient Post-Training Library for Models Ranging from tiny to >100B Parameters, scaling from 1 GPU to 100s

Nemo-Reinforcer is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

  • Seamless integration with HuggingFace for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
  • High-performance implementation with Megatron core, supporting various parallelism techniques for large models (>100B) and large context lengths.
  • Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
  • Flexibility with a modular design that allows easy integration and customization.
  • Comprehensive documentation that is both detailed and user-friendly, with practical examples.

Features

βœ… Available now | πŸ”œ Coming in v0.2

  • βœ… Fast Generation - vLLM backend for optimized inference
  • βœ… HuggingFace Integration - Works with 1-8B models (Qwen1.5, Llama)
  • βœ… Distributed Training - FSDP support and Ray-based infrastructure
  • βœ… Environment Support - Support for multi-environment training.
  • βœ… Learning Algorithms - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
  • βœ… Worker Isolation - Process isolation between RL Actors (no worries about global state)
  • πŸ”œ Larger Model Support - Native PyTorch support for models up to 70B parameters
  • πŸ”œ Advanced Parallelism - FSDP2, TP, SP, and sequence packing for efficient training
  • πŸ”œ Environment Isolation - Dependency isolation between components
  • πŸ”œ DPO Algorithm - Direct Preference Optimization for alignment

Installation

# For faster setup we use `uv`
pip install uv

# Specify a virtual env that uses Python 3.12
uv venv -p python3.12.9 .venv
# Install NeMo-Reinforcer with vllm
uv pip install -e .
# Install NeMo-Reinforcer with dev/test dependencies
uv pip install -e '.[dev,test]'

# Use uv run to launch any runs. 
# Note that it is recommended to not activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py

Quick start

Reminder: Don't forget to set your HF_HOME and WANDB_API_KEY (if needed). You'll need to do a huggingface-cli login as well for Llama models.

SFT

We provide a sample SFT experiment that uses the SQuAD dataset.

Single Node

The experiment is set up to run on 8 GPUs. If using a machine that has access to 8 GPUs, you can launch the experiment as follows:

uv run python examples/run_sft.py

This trains Llama3.1-8B on 8 GPUs. To run on a single GPU, we'll have to override a few of the experiment settings. We replace the 8B model with a smaller 1B model, decrease the batch size, and update the cluster configuration to use a single gpu:

uv run python examples/run_sft.py \
  policy.model_name="meta-llama/Llama-3.2-1B" \
  policy.train_global_batch_size=16 \
  sft.val_global_batch_size=16 \
  cluster.gpus_per_node=1

Refer to sft.yaml for a full list of parameters that can be overridden.

Multi-node

For distributed training across multiple nodes:

Set UV_CACHE_DIR to a directory that can be read from all workers before running any uv run command.

export UV_CACHE_DIR=/path/that/all/workers/can/access/uv_cache
# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# SFT experiment uses Llama-3.1-8B model
COMMAND="uv pip install -e .; uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

GRPO

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

Single Node

To run GRPO on a single GPU for Llama-3.2-1B-Instruct:

# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 gpus,

# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
  cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py \
  policy.model_name="Qwen/Qwen2-1.5B" \
  checkpointing.checkpoint_dir="results/qwen1_5b_math" \
  logger.wandb_enabled=True \
  logger.wandb.name="grpo-qwen1_5b_math" \
  logger.num_val_samples_to_print=10 \

Multi-node

# Run from the root of NeMo-Reinforcer repo
NUM_ACTOR_NODES=2
# Add a timestamp to make each job name unique
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv pip install -e .; uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
RAY_DEDUP_LOGS=0 \
UV_CACHE_DIR=YOUR_UV_CACHE_DIR \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

Cluster Start

Please visit Cluster Start for how to get started on Slurm or Kubernetes.

0