RLFactory is an easy and efficient RL post-training framework for Agentic Learning.
RL-Factory decouples the environment from RL post-training, enabling training with just a tool config and reward function while supporting async tool-calling to make RL post-training 2x faster.
Current version natively supports one-click DeepSearch training and features multi-turn tool-calling, model judge reward, and training of multiple models including Qwen3. More easy and efficient agentic learning modules will be added in upcoming features.
Our goal is to enable users to focus on reward logic and tool setup for fast agentic learning with minimal code, while hardcore developers could focus on improving training efficiency and model performance.
For easy-to-use, we decouple the environment from RL-based post-training with several advantages.
- Easy-to-design reward function: Calculate rewards through rules, model-judge, and even tools to meet all your requirements for reward function.
- Seamless tool setup: Simply provide the configuration file for your MCP tools and custom tools to integrate them into RL learning.
- Multi-Agent extention: Convert your agent to the MCP format for easy Multi-Agent Interaction. LLM chat simulation will be also added in the future to improve multi-turn dialogue capabilities.
For efficient learning, we develope several essential modules within the RL post-training framework, making training 2x faster.
- Efficient tool-call: Improve online RL training efficiency through batch processing and asynchronous parallel tool calls.
- Efficient reward calculation: Deploy LRM (like QwQ-32B) in a distributed manner for efficient model judging, and use asynchronous parallelism to speed up reward calculation.
For future progression, we will continue to prioritize "easy" and "efficient".
- Easier: Use WebUI to process data, define tool & environment, adjust training configuration, and manage project. (The WebUI is under rapid development.)
- More efficient: Continuously iterating and improving the training framework (such as AsyncLLMEngine) and RL training algorithms.
We’ll keep a fast release cycle to quickly deliver and polish the upcoming features.
- Version 0.1
- Environment decouple: define your tool-use envinroment easily (tools setup and reward function definition)
- Qwen3 Model support: quickly train your agent model using Qwen3 (much better than Qwen2.5 in tool-call)
- Efficient training: 2x faster than existing frameworks for rapid model iteration (mainly through async tool-use)
- Version 0.2 (within 2 weeks)
- WebUI: build a WebUI for data processing, tool & environment definition, training configuration, and project management
- More efficient training: support the AsyncLLMEngine for more efficient rollout
- More models: test more models (such as Deepseek, Llama, etc.) and add corresponding support configurations
- More applications: help create more demos (such as TravelPlanner) to adapt to more benchmarks
- Dependencies (Key)
Cuda: >=12.0 (Recommended: 12.4) Python: >=3.10 (Recommended: 3.10) # For Qwen3 model support vllm: >=0.8.3 (Recommended: 0.8.5)
- Install Requirements
pip3 install accelerate bitsandbytes datasets deepspeed==0.16.4 einops flash-attn==2.7.0.post2 isort jsonlines loralib optimum packaging peft pynvml>=12.0.0 ray[default]==2.46.0 tensorboard torch torchmetrics tqdm transformers==4.51.3 transformers_stream_generator wandb wheel pip3 install vllm==0.8.5 # Mainly for Qwen3 model support pip3 install "qwen-agent[code_interpreter]" pip3 install llama_index bs4 pymilvus infinity_client codetiming tensordict==0.6 omegaconf torchdata==0.10.0 hydra-core easydict dill python-multipart mcp pip3 install -e . --no-deps pip3 install faiss-gpu-cu12 # Optional, needed for end-to-end search model training with rag_server
Note: Currently, only Qwen models are tested. - What do you need to provide?
- An environment is enough! See the minimal tutorial in
docs/rl_factory/main_tutorial.md
- An environment is enough! See the minimal tutorial in
- Training Command
# Before running, modify MODEL_PATH, REWARD_MODEL_PATH, and several actor_rollout_ref.env parameters as needed bash main_grpo.sh
-
In
docs/rl_factory/main_tutorial.md
, we provide an RLFactory reproduction example of Search-R1. We useQwen3-4B
andQwen3-8B
as the base model for RL training. -
Easy: Start with Qwen3 and MCP tools to quickly train your own DeepSearch Agent Model.
- Provide only one tool configuration and one reward function to start training!
- Qwen3 demonstrates significant advantages in Agent Learning. It can accurately call tools even without SFT, and it also supports the MCP protocol.
-
Efficient: Enjoy the efficient training enabled by asynchronous parallel tool-call.
- Compared to Search-R1 based on the original verl, the required training time is reduced by 1.5 to 2 times, and the efficiency gain is even greater if a model judge is involved.
- After 100 steps of training (about 5 hours in 8*A100),
Qwen3-4B
achieves a score of 0.458 andQwen3-8B
achieves a score of 0.463.
-
The table below presents our training results under identical computational resources, software, and verl versions
- RLFactory trains in about half the time of Search-R1, demonstrating high efficiency.
- Qwen3 as the base model outperforms Qwen2.5, enabling domain-specific tool-calling via RL post-training without SFT.
Model Name | Test Score (NQ) | Total Training Time (100 step) | Seconds per step | Training Resources |
---|---|---|---|---|
Search-R1-Qwen2.5-3B-Instruct-GRPO | 0.356 | 7.39 h | 266 s | A100 × 8 |
Search-R1-Qwen2.5-7B-Instruct-GRPO | 0.451 | 9.25 h | 333 s | A100 × 8 |
Search-R1-Qwen3-4B-GRPO | 0.420 | 7.95 h | 286 s | A100 × 8 |
RLFactory-Qwen3-4B-GRPO | 0.458 | 5.30 h | 190 s | A100 × 8 |
RLFactory-Qwen3-8B-GRPO | 0.463 | 5.76 h | 207 s | A100 × 8 |
We welcome all users and developers to contribute code to RLFactory. If you have any questions, encounter bugs, or would like to collaborate on development, please feel free to contact us!
- Submit an issue directly on GitHub.
- Contact us via email at chaijiajun@meituan.com or gjyin@outlook.com.
- Join our WeChat group and become a pioneer in Agent model training!
This repo benefits from verl, Search-R1, Qwen-Agent. Thanks for their wonderful works. We will also introduce TRL in the future to further expand the applicability of our framework.