This is the official repository of Reasoning360, a project dedicated to open research on large-scale reasoning models. The repository currently includes data processing and filtering tools, reinforcement learning (RL) training pipelines, and an evaluation suite. It's initialized from verl.
-
Our paper to an 8000 alyze and improve multi-domain RL for LLM reasoning with Guru data "Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective" is out on arxiv.
-
The ready-to-train 92K Guru RL data across six domains is released under LLM360 huggingface.
git clone git@github.com:LLM360/Reasoning360.git
cd Reasoning360
conda create -n Reasoning360 python=3.12
conda activate Reasoning360
conda install -c nvidia/label/cuda-12.4.0 cuda-toolkit cuda-nvcc
pip install uv # using uv to install packages is faster than pip
uv pip install torch==2.6.0
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -e .[gpu,math,vllm,test]
Alternatively, you can refer to verl installment guidance for setup.
The full ready-to-train 92K Guru RL data is already released under LLM360 huggingface! If you would like to build (or experience) the data pipeline from scratch, we also provide detailed guidances for data preparation and filtering by data difficulty levels.
Quick data check:
import json
from datasets import load_dataset
# Load dataset
train_data = load_dataset("LLM360/guru-RL-92k", split="train", streaming=True)
print(f"Columns: {train_data.column_names}")
print(f"First item: {next(iter(train_data))}")
Download the data and prepare them into .parquet
, the expected default format in training script. We provide a simple script to download and organize Guru data scripts/tools/download_guru.py
, with all dataset files for training, online & offline evaluation to local directories.
By defauly, training files will be put in ./data/train
. Online evaluation files will be put in ./data/online_eval
. Offline evaluation files will be put in ./data/offline_eval
.
Run tools/change_tokenizer_config.py
if you want to apply 'think'-aware chat template. Now only the 'Qwen' families are supported.
python tools/change_tokenizer_config.py -i <input_model_directory> -o <output_model_directory>
SandboxFusion provides secure code execution for training and evaluation. It supports both containerized SLURM deployment and local installation.
Option 1: SLURM Container (Recommended for production)
# Download container
enroot import docker://varad0309/code_sandbox:server
# Deploy with SLURM
sbatch scripts/sandbox/run_server.sbatch
Option 2: Local Installation (Development only)
git clone https://github.com/bytedance/SandboxFusion.git
cd SandboxFusion
poetry install
make run-online
Configure sandbox servers in your training script:
# Single server
export SANDBOX_FUSION_SERVERS="fs-mbz-gpu-044"
# Multiple servers (load balancing)
export SANDBOX_FUSION_SERVERS="fs-mbz-gpu-044,fs-mbz-gpu-045"
Or programmatically:
from verl.utils.reward_score.coder1.sandboxfusion_exec import code_exec_sandboxfusion
# Single server
success, output = code_exec_sandboxfusion(
code="print('Hello')",
sandbox_servers="fs-mbz-gpu-044"
)
# Multiple servers
success, output = code_exec_sandboxfusion(
code="print('Hello')",
sandbox_servers=["fs-mbz-gpu-044", "fs-mbz-gpu-045"]
)
For detailed setup instructions, see verl/utils/reward_score/coder1/README.md
.
We provide the multi-node training slurm script using a math3k
subset data for ablation, not the full data. Change the SHARED_DATA_PATH
upon your data path.
sbatch scripts/train/example_multinode_rl_qwen32b_base.sh
If you need to train on the full data or include STEM data in Guru, host the llm-as-verifier model first before launching the training.
sbatch scripts/tools/serve_llm_as_verifier.sh
Then fill in the export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>"
by the llm-as-verifier server IP. It uses one GPU node to serve a 1.5B general-verifier now.
(TODO: build a single-node script not using slurm)
We provide a evaluation suite of of 17 tasks supporting multi-node inference based on verl. For quick start, run
sbatch scripts/offline_eval/example_multinode_eval_guru7b.sh
Please refer to scripts/offline_eval/README.md
if you would like to know and customize evaluation details.
Step1: Data preprocessing script
In preprocessing, we will process the data into a list of dictionaries, and then save it into a parquet file.
-
Prompt preprocessing
We need to process the raw question into a prompt ready to be fed to the LLM. An example is [1].
Each data point is processed into a dict, and we need to specify the prompt within the data dict:
"prompt": [{ "role": "user", "content": prompt }],
Note that, when we use verl to train the model, it will turn into a prompt string with
apply_chat_template
.Note that:
- You will probably need to add some task-specific instruction in the
question
. E.g., for math, we concatenate the raw problem withPlease output the final answer within \\boxed{}.
, so that it's easy to extract the answer from model output. - You don't need to instruct the model to "think step by step" or "wrap your thinking process in
<think>
<\think>
". This should be taken care by verl during training withapply_chat_template
. To enable this, we have a script to modify the chat template of a huggingface model (currently only tested on Qwen). - Please add an instruction under the README of
data_preprocess
- You will probably need to add some task-specific instruction in the
-
Reward function
We need to specify the information regarding reward calculation for the new dataset.
This typically includes three keys in the dict:
data_source
,reward_model["ground_truth"]
,extra_info
.In our training, we use
default_compute_score
, which routes the reward computing to a specific reward function implementation based ondata_source
.ground_truth
andextra_info
will be passed as arguments.
Step2: Reward function
Please look at default_compute_score
. You can write your own reward function for the task, and import it here. It's highly recommended to add a timeout module to avoid the training being stuck by a corner case of reward function (example).
Step3: Training script
Verify the inclusion of a new dataset by actually training models with it. Please refer the template script in this repo.
We use pre-commit to enforce code formatting. Before committing, make sure you have run the pre-commit checks.
pre-commit install
pre-commit run --all-files
Please make a pull request including the data preprocessing script, reward function, and the training script.
If you find the repo helpful, please cite:
@misc{cheng2025revisiting,
title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective},
author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
journal = {arXiv preprint arXiv:2506.14965},
year = {2025},
doi = {10.48550/arXiv.2506.14965},
url = {https://arxiv.org/abs/2506.14965}
}