Reasoning360

This is the official repository of Reasoning360, a project dedicated to open research on large-scale reasoning models. The repository currently includes data processing and filtering tools, reinforcement learning (RL) training pipelines, and an evaluation suite. It's initialized from verl.

🔥News

Our paper to an 8000 alyze and improve multi-domain RL for LLM reasoning with Guru data "Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective" is out on arxiv.
The ready-to-train 92K Guru RL data across six domains is released under LLM360 huggingface.

Installation

git clone git@github.com:LLM360/Reasoning360.git
cd Reasoning360

conda create -n Reasoning360 python=3.12
conda activate Reasoning360
conda install -c nvidia/label/cuda-12.4.0 cuda-toolkit cuda-nvcc
pip install uv # using uv to install packages is faster than pip
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]

Alternatively, you can refer to verl installment guidance for setup.

Data preparation

The full ready-to-train 92K Guru RL data is already released under LLM360 huggingface! If you would like to build (or experience) the data pipeline from scratch, we also provide detailed guidances for data preparation and filtering by data difficulty levels.

Quick data check:

import json
from datasets import load_dataset

# Load dataset
train_data = load_dataset("LLM360/guru-RL-92k", split="train", streaming=True)

print(f"Columns: {train_data.column_names}")
print(f"First item: {next(iter(train_data))}")

RL Training

(1) Download data

Download the data and prepare them into .parquet, the expected default format in training script. We provide a simple script to download and organize Guru data scripts/tools/download_guru.py, with all dataset files for training, online & offline evaluation to local directories. By defauly, training files will be put in ./data/train. Online evaluation files will be put in ./data/online_eval. Offline evaluation files will be put in ./data/offline_eval.

(2) [Optional] Customize chat template

Run tools/change_tokenizer_config.py if you want to apply 'think'-aware chat template. Now only the 'Qwen' families are supported.

python tools/change_tokenizer_config.py -i <input_model_directory> -o <output_model_directory>

(3) [Optional] SandboxFusion Code Execution

SandboxFusion provides secure code execution for training and evaluation. It supports both containerized SLURM deployment and local installation.

Quick Setup

Option 1: SLURM Container (Recommended for production)

# Download container
enroot import docker://varad0309/code_sandbox:server

# Deploy with SLURM
sbatch scripts/sandbox/run_server.sbatch

Option 2: Local Installation (Development only)

git clone https://github.com/bytedance/SandboxFusion.git
cd SandboxFusion
poetry install
make run-online

Configuration

Configure sandbox servers in your training script:

# Single server
export SANDBOX_FUSION_SERVERS="fs-mbz-gpu-044"

# Multiple servers (load balancing)
export SANDBOX_FUSION_SERVERS="fs-mbz-gpu-044,fs-mbz-gpu-045"

Or programmatically:

from verl.utils.reward_score.coder1.sandboxfusion_exec import code_exec_sandboxfusion

# Single server
success, output = code_exec_sandboxfusion(
    code="print('Hello')", 
    sandbox_servers="fs-mbz-gpu-044"
)

# Multiple servers
success, output = code_exec_sandboxfusion(
    code="print('Hello')", 
    sandbox_servers=["fs-mbz-gpu-044", "fs-mbz-gpu-045"]
)

For detailed setup instructions, see verl/utils/reward_score/coder1/README.md.

(4) Train

We provide the multi-node training slurm script using a math3k subset data for ablation, not the full data. Change the SHARED_DATA_PATH upon your data path.

sbatch scripts/train/example_multinode_rl_qwen32b_base.sh

If you need to train on the full data or include STEM data in Guru, host the llm-as-verifier model first before launching the training.

sbatch scripts/tools/serve_llm_as_verifier.sh

Then fill in the export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>" by the llm-as-verifier server IP. It uses one GPU node to serve a 1.5B general-verifier now.

(TODO: build a single-node script not using slurm)

Evaluation

We provide a evaluation suite of of 17 tasks supporting multi-node inference based on verl. For quick start, run

sbatch scripts/offline_eval/example_multinode_eval_guru7b.sh

Please refer to scripts/offline_eval/README.md if you would like to know and customize evaluation details.

Contributing

Add a new dataset for training (or evaluation)

Step1: Data preprocessing script

In preprocessing, we will process the data into a list of dictionaries, and then save it into a parquet file.

Prompt preprocessing

We need to process the raw question into a prompt ready to be fed to the LLM. An example is [1].

Each data point is processed into a dict, and we need to specify the prompt within the data dict:
```
"prompt": [{
    "role": "user",
    "content": prompt
}],
```
Note that, when we use verl to train the model, it will turn into a prompt string with apply_chat_template.

Note that:
- You will probably need to add some task-specific instruction in the question. E.g., for math, we concatenate the raw problem with Please output the final answer within \\boxed{}., so that it's easy to extract the answer from model output.
- You don't need to instruct the model to "think step by step" or "wrap your thinking process in <think> <\think>". This should be taken care by verl during training with apply_chat_template. To enable this, we have a script to modify the chat template of a huggingface model (currently only tested on Qwen).
- Please add an instruction under the README of data_preprocess
Reward function

We need to specify the information regarding reward calculation for the new dataset.

This typically includes three keys in the dict: data_source, reward_model["ground_truth"], extra_info.

In our training, we use default_compute_score, which routes the reward computing to a specific reward function implementation based on data_source. ground_truth and extra_info will be passed as arguments.

Step2: Reward function

Please look at default_compute_score. You can write your own reward function for the task, and import it here. It's highly recommended to add a timeout module to avoid the training being stuck by a corner case of reward function (example).

Step3: Training script

Verify the inclusion of a new dataset by actually training models with it. Please refer the template script in this repo.

Pre-commit

We use pre-commit to enforce code formatting. Before committing, make sure you have run the pre-commit checks.

pre-commit install
pre-commit run --all-files

Pull Request

Please make a pull request including the data preprocessing script, reward function, and the training script.

Citation

If you find the repo helpful, please cite:

@misc{cheng2025revisiting,
  title         = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective},
  author        = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
  journal       = {arXiv preprint arXiv:2506.14965},
  year          = {2025},
  doi           = {10.48550/arXiv.2506.14965},
  url           = {https://arxiv.org/abs/2506.14965}
}

Name		Name	Last commit message	Last commit date
Latest commit History 548 Commits
.github		.github
.vscode		.vscode
data_preprocess		data_preprocess
docker		docker
docs		docs
examples		examples
model_filtering		model_filtering
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reasoning360

🔥News

Table of Contents

Installation

Data preparation

RL Training

(1) Download data

(2) [Optional] Customize chat template

(3) [Optional] SandboxFusion Code Execution

Quick Setup

Configuration

(4) Train

Evaluation

Contributing

Add a new dataset for training (or evaluation)

Pre-commit

Pull Request

Citation

About

Uh oh!

Releases

Packages

Contributors 16

Languages

License

LLM360/Reasoning360

Folders and files

Latest commit

History

Repository files navigation

Reasoning360

🔥News

Table of Contents

Installation

Data preparation

RL Training

(1) Download data

(2) [Optional] Customize chat template

(3) [Optional] SandboxFusion Code Execution

Quick Setup

Configuration

(4) Train

Evaluation

Contributing

Add a new dataset for training (or evaluation)

Pre-commit

Pull Request

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 16

Languages

Packages