We present Perception-R1, a scalable RL framework using Group Relative Policy Optimization (GRPO) during MLLM post-training. Key innovations:
🎯 Perceptual Perplexity Analysis: We introduce a novel analytical framework that reveals critical thresholds for effective reinforcement learning in perception tasks, providing insights into when and how RL can improve visual understanding.
🚀 GRPO Optimization: Scalable policy learning with meticulously crafted rule-based reward shaping.
🔥 Surprising Performance: Perception-R1 achieves remarkable improvements across multiple visual perception benchmarks, notably reaching 31.9% mAP on COCO2017 validation set - making it the first 3B-scale MLLM to achieve such performance.
-
2025-04-10
🎄: Initial release of Perception-R1 models and evaluation code. - 🧐: Release the training code and datas of Perception-R1 on grounding task.
-
2025-05-27
🎉: Additional perception tasks coming soon (detection, OCR, counting...)
# Create and activate a new conda environment
conda create -n pr1 python=3.10 -y
conda activate pr1
# Clone the repository and install dependencies
git clone https://github.com/linkangheng/PR1.git
cd PR1
pip install -e ".[dev]"
pip install flash-attn==2.7.0.post2 --no-build-isolation
Before training, modify the script to specify your model and data paths. Then run the experiment using:
bash local_scripts/train/train_qwen2_2b_vl_grounding.sh
The training script includes comprehensive configurations for hyperparameters, data loading, and model checkpointing. For custom training scenarios, you can adjust parameters such as learning rate, batch size, and optimization settings directly in the script.
Download the evaluation datas from 🤗huggingface, and then unzip them in the eval/
folder. The directory structure should be:
Important: The COCO images are not included in the package and must be downloaded separately. Please download the COCO images from the official COCO website and place them in the
eval/images/coco/
directory.
eval/
├── images/
│ ├── coco/
│ ├── pixmo-count/
│ └── ocr/
└── jsons/
├── counting/
├── grounding/
├── ocr/
└── detection/
python eval/evaluate_counting.py \
--model_path 'Kangheng/PR1-Qwen2-VL-2B-Counting' \
--anno_dir 'eval/jsons/counting/' \
--image_dir 'eval/images/'
python eval/evaluate_grounding.py \
--model_path 'Kangheng/PR1-Qwen2-VL-2B-Grounding' \
--anno_dir 'eval/jsons/grounding/' \
--image_dir 'eval/images/coco/'
pip install pycocotools
python eval/evaluate_detection.py \
--model_path Kangheng/PR1-Qwen2.5-VL-3B-Detection \
--anno_dir 'eval/jsons/detection/coco_val2017.json' \
--image_dir 'eval/images/coco/val2017/'
python eval/evaluate_ocr.py \
--model_path Kangheng/PR1-Qwen2-VL-2B-OCR \
--anno_dir 'eval/jsons/ocr/' \
--image_dir 'eval/images/ocr/'
This work builds upon several important open-source projects. We would like to acknowledge the following repositories that inspired our research:
If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work ✏️:
@article{yu2025perception,
title={Perception R1: Pioneering Perception Policy with Reinforcement Learning},
author={Yu, En and Lin, Kangheng and Zhao, Liang and Yin, Jisheng and Peng, Yuang and Wei, Haoran and Sun, Jianjian and Han, Chunrui and Ge, Zheng and Zhang, Xiangyu and Jiang, Daxin and Wang, Jingyu and Tao, Wenbing},
journal={arXiv preprint arXiv:2504.07954},
year={2025}
}