8000 GitHub - Codzart/POLARIS: Scaling RL on advanced reasoning models
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Codzart/POLARIS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POLARIS

🌠 A POst-training recipe for scaling RL on Advanced ReasonIng modelS 🚀

Notion Twitter Hugging Face Model Hugging Face Data Paper

Overview

Polaris is an open-source post-training recipe that leverages reinforcement learning (RL) scaling to further optimize models with strong reasoning capabilities. Our work demonstrates that even state-of-the-art models like Qwen3-4B can achieve remarkable gains on complex reasoning tasks when enhanced with Polaris. By training with open-source data and academic-grade resources, Polaris elevates the performance of open-recipe reasoning models to an entirely new level. In benchmark evaluations, our approach astonishingly outperforms leading commercial systems such as Claude-4-Opus, Grok-3-Beta, and o3-mini-high(2025/01/03).

This work is done as part of the HKU NLP Group and Bytedance Seed. Our training and evaluation codebase is built on Verl. To foster progress in scaling RL on advanced reasoning models, we are open‐sourcing our complete dataset, code, and training details for the research community.

🔥Releases

[2025/06/20]

  • 🧾 The Blog that details our training recipe: Notion and Blog
  • 🤗 Model weights: Polaris-4B-Preview and Polaris-7B-Preview. Polaris-4B-Preview is fine-tuned from Qwen3-4B and Polaris-7B-Preview is fine-tuned from Deepseek-R1-Distill-Qwen-7B.
  • 📚 The filtered training dataset with difficulty distribution Polaris-Dataset-53K
  • ⏰ Full training code and training scripts will be available in one week.

Running environment

cd Polaris
pip install -e ./verl 
pip install -e ./
pip install transformers==4.51.0
pip install vllm==0.8.4
pip install tensordict==0.6.2

# do not use xformers backend
unset VLLM_ATTENTION_BACKEND

Demo

import torch
from transformers import AutoTokenizer
from vllm import SamplingParams, LLM


example = {
        "question": "Find the largest possible real part of \\[(75+117i)z+\\frac{96+144i}{z}\\]where $z$ is a complex number with $|z|=4$.\nLet's think step by step and output the final answer within \\boxed{}.",
        "answer": "540"
}


model = "/path/to/Polaris-4B-Preview"

tokenzier = AutoTokenizer.from_pretrained(model)

llm = LLM(
    model=model,
    dtype=torch.bfloat16,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(
    temperature=1.4,
    top_p=1.0,
    max_tokens=90000
)

question = example["question"]
answer = example["answer"]
output = llm.generate(
            prompts=tokenzier.apply_chat_template(conversation=[{"content": question, "role": "user"}],
                                                  add_generation_prompt=True,
                                                  tokenize=False),
            sampling_params=sampling_params
        )
print(f"***QUESTION***:\n{question}\n***GROUND TRUTH***:\n{answer}\n***MODEL OUTPUT***:\n{output[0].outputs[0].text}\n")

Training

Step1: Data preparation

The training data used in this work is filtered from DeepScaleR-dataset-40K and AReaL-dataset-106K. We provide the parquet data which can be directly used in training. If your data is in json or jsonl format, please use the following cmd for converting it into the parquet format.

# Generate parquet files for parquet_data/{jsonl_file_name}.parquet 
python scripts/data/polaris_dataset.py --jsonl_file data/jsonl_data/polaris-data-53K.jsonl # => data/jsonl_data/polaris-data-53K.parquet

Step2: Temperature searching for diversity rollouts

Temperature searching is highly recommended before each stage of training as suggested by the diversity-based rollouts sampling section in our blog.

# the following code will provide the optimal training temperature for you
cd evaluation
python search_optimal_temperature.py --start 0.6 (recommended decoding temperature by the model developers) --end 1.5 --step 0.1 --model /path/to/qwen3-4b (base model) --n 16 --new_tokens 50000
# after the searching process, run:
python get_optimal_temperature.py  --start 0.6 --end 1.5 --step 0.1 --model /path/to/qwen3-4b (base model)

Step3: Multi-stage Training

The training scripts for Polaris that details are avaliable here The training process for Polaris-4B-preview requires at least 4 nodes. Our multi-node training is based on Ray. Please run the command on all nodes.

Stage1-training

# run ray stop if needed
python train_with_ray.py  --model /path/to/qwen3-4b --name Polaris-4B-stage1 (your experiment name) --n_nodes 4  --head True/False (True for head node)  --sh ./scripts/train/polaris_4b_stage1_40k_t1.4.sh

Stage2-training

# convert the checkpoint after stage1-training to hf model
python verl/scripts/model_merger.py --local_dir /path/to/checkpoints/global_step_XXX/actor --target_dir /path/to/hf/stage1-checkpoint
# run ray stop if needed
python train_with_ray.py  --model /path/to/hf/stage1-checkpoint --name Polaris-4B-stage2 --n_nodes 4  --head True/False   --sh ./scripts/train/polaris_4b_stage2_48k_t1.45.sh

Stage3-training

# convert the checkpoint after stage1-training to hf model
python verl/scripts/model_merger.py --local_dir /path/to/checkpoints/global_step_XXX/actor --target_dir /path/to/hf/stage2-checkpoint
# run ray stop if needed
python train_with_ray.py  --model /path/to/hf/stage2-checkpoint --name Polaris-4B-stage3 --n_nodes 4  --head True/False  --sh ./scripts/train/polaris_4b_stage3_52k_t1.5.sh

📊Evaluation

We recommend using a higher temperature for decoding than that suggested for Qwen3 (0.6 → 1.4). However, it is not advisable to exceed the temperature used during training. For POLARIS, a longer response length (> 64K) should be utilized to prevent performance degradation from truncation, which could otherwise cause its performance to fall below that of Qwen3. All other settings remain the same.

Evaluation command based on verl:

./scripts/eval/eval_model_aime24.sh --model [CHECKPOINT_PATH]  --n 32 --max_length 90000  --t 1.4
./scripts/eval/eval_model_aime25.sh --model [CHECKPOINT_PATH]  --n 32 --max_length 90000  --t 1.4 or 1.45

Example inference

Results

Models AIME24 avg@32 AIME25 avg@32 Minerva Math avg@4 Olympiad Bench avg@4 AMC23 avg@8
Deepseek-R1-Distill-Qwen-7B 55.0 39.7 36.7 56.8 81.9
AReal-boba-RL-7B 61.9 48.3 39.5 61.9 86.4
Skywork-OR1-7B-Math 69.8 52.3 40.8 63.2 85.3
POLARIS-7B-Preview 72.6 52.6 40.2 65.4 89.0
Deepseek-R1-Distill-Qwen-32B 72.6 54.9 42.1 59.4 84.3
qwen3-32B 81.4 72.9 44.2 66.7 92.4
qwen3-4B 73.8 65.6 43.6 62.2 87.2
POLARIS-4B-Preview 81.2 79.4 44.0 69.1 94.8

Acknowledgements

The training and evaluation codebase is heavily built on Verl. The reward function in polaris in from DeepScaleR. Our model is trained on top of Qwen3-4B and DeepSeek-R1-Distill-Qwen-7B. Thanks for their wonderful work.

Citation

@misc{Polaris2025,
    title = {POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models},
    url = {https://hkunlp.github.io/blog/2025/Polaris},
    author = {An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng}
    year = {2025}
}

About

Scaling RL on advanced reasoning models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%
0