R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

Visual (Single) Object Tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task can be simplified to template matching between image pairs, with traditional trackers predominantly employing explicit classification-regression modeling through Correlation Filters, Siamese networks, and Vision Transformers (ViT). Leveraging advancements in Multi-Modal Large Language Models (MLLMs) such as Qwen2.5-VL and their robust grounding capabilities, we explore adopting MLLMs for end-to-end tracking tasks, eliminating the need for fragmented subtask modeling.
R1-Track supports flexible initialization from either text descriptions or bounding boxes.

Prompt

SFT or RL_wo_Think

    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": xxx,
            },
            {
                "type": "image",
                "image": xxx,
            },
            {"type": "text", "text": "Please identify the target specified by the bounding box [241,66,329,154] in the first image and locate it in the second image. \n Return the coordinates in [x_min,y_min,x_max,y_max] format."},
            # R1-Track-100k:
            #Given two images, you need to:\n1. Analyze and Identify the target object marked by bounding box <BBOXFLAG> in <image_1>;\n2. Re-locate this target in <image_2>;\n3. Return [x_min, y_min, x_max, y_max] coordinates of the target in <image_2>.
        ],
    }

RL_with_Think

    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": xxx,
            },
            {
                "type": "image",
                "image": xxx,
            },
            {"type": "text", "text": "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. \n The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in <answer> </answer> tags. \n Please identify the target specified by the bounding box [241,66,329,154] in the first image and locate it in the second image. \n Return the coordinates in [x_min,y_min,x_max,y_max] format."},
            # R1-Track-100k: 
            #You FIRST think about the reasoning process as an internal monologue and then provide the final answer. \n The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in <answer> </answer> tags. \n Given two images, you need to:\n1. Analyze and Identify the target object marked by bounding box <BBOXFLAG> in <image_1>;\n2. Re-locate this target in <image_2>;\n3. Return [x_min, y_min, x_max, y_max] coordinates of the target in <image_2>.
        ],
    }

Data

R1-Track-5k
- This dataset is available at https://huggingface.co/datasets/WangBiao/R1-Track-5k (For EasyR1) and https://huggingface.co/datasets/WangBiao/R1-Track-Data-ShareGPT (For Llamafactory).
- $336 \times 336$, 2 images.
- This dataset was randomly sampled from GOT10k and has not undergone manual review. Some image pairs are of relatively low quality.
R1-Track-100k
- This dataset is available at https://huggingface.co/datasets/WangBiao/R1-Track-100k.
- $112 \times 112, 224 \times 224, 336 \times 336, 448 \times 448$, 2 images and 3 images
- This dataset was also randomly sampled from GOT10k

Quick Start

Generate data (Optional)

python generation_data/sampling_data.py
python generation_data/gen_huggingface_data.py

SFT (Optional)

Env Preparation

Please refer to the official LLaMA-Factory repo for env configuration guidelines.

Train Lora

CUDA_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/train_lora/r1_track_lora_sft.yaml

Merge Lora

llamafactory-cli export examples/merge_lora/r1_track_lora_sft.yaml

RL

Env Preparation

Please refer to the official EasyR1 repo for env configuration guidelines.

Train

cd EasyR1
bash examples/qwen2_5_vl_3b_track5k_grpo_w_think.sh

Infer

Deployment a openai server by vllm

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --served-model-name R1-Track --model  WangBiao/R1-Track-GRPO  --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --port 8888 --limit-mm-per-prompt image=2

Tracking on GOT10k

python infer_script/r1track.py

Results

Our R1-Track-GRPO-wo-Think(-0503) model achieved an AO score of 0.680 on GOT10k test set.

Tracker/GOT10k	Crop Size	Finetune Data	$AO$	$SR_{0.5}$	$SR_{0.75}$	Params	Ckpt
LLaVA-1.5	-	-	0.235	0.202	-	7B	-
Qwen2.5-VL-7B-Instruct	-	-	0.126	0.011	-	7B	-
Qwen2.5-VL-72B-Instruct	-	-	-	-	-	72B	-
R1-Track-SFT	$336 \times 336$	R1-Track-5K	0.543	0.633	0.338	3B	R1-Track-SFT
R1-Track-GRPO	$336 \times 336$	R1-Track-5K	0.586	0.676	0.470	3B	R1-Track-GRPO
R1-Track-GRPO-wo-Think	$336 \times 336$	R1-Track-5k	0.585	0.673	0.500	3B	R1-Track-GRPO-wo-Think
R1-Track-SFT	$336 \times 336$	R1-Track-100k	0.667	0.746	0.620	3B	R1-Track-SFT-0503-lora
R1-Track-GRPO	$336 \times 336$	R1-Track-100k	0.672	0.759	0.624	3B	-
R1-Track-GRPO-wo-Think	$336 \times 336$	R1-Track-100k	0.680	0.766	0.637	3B	R1-Track-GRPO-wo-Think-0503

Note: In our experiment, we found that letting the 3b base model (without cold-start by COT data) directly output the result instead of following <think></think><answer></answer> would lead to a much higher score on GOT-10k.

Visualization

Click anywhere

News

[2025/04/02] We released 🤗R1-Track-5K dataset!
[2025/04/20] We released 🤗R1-Track-SFT lora weights, fine-tuned on the R1-Track-5k!
[2025/04/24] We released 🤗R1-Track-Data-ShareGPT dataset. You can effortlessly integrate it with LlamaFactory for use!
[2025/04/27] We released the R1-Track inference script for GOT-10k dataset evaluation!
[2025/04/27] We released the R1-Track-GRPO training script based on EasyR1!
[2025/04/28] We released 🤗R1-Track-GRPO model!
[2025/04/28] We released 🤗R1-Track-GRPO-wo-Think model!
[2025/05/03] We released 🤗R1-Track-100K dataset!
[2025/05/07] We released a simple script for generating RL training data!
[2025/05/07] We released 🤗R1-Track-GRPO-wo-Think-0503 model!
[2025/05/11] We released 🤗R1-Track-SFT lora weights, fine-tuned on the R1-Track-100k!

TODO

Generate a more refined, large-scale, and diversified dataset based on existing tracking training data;
Train the 7B model;
Support for multi-image or video (i.e. temporal information modeling);
Support for RGB, RGB-T, RGB-D, RGB-E modalities simultaneously;
Explore improved methods for generating cold-start COT data and train a real COT model;
Supports evaluating other tracking datasets;
......

We will strive to elevate R1-Track to the T0 level of trackers.

Please open an issue if you have any questions.

Acknowledgement

Base Model: We use Qwen2.5-Vl-Instruct-3B as our base model.
SFT: We perform supervised fine-tuning based on LLama-Factory.
RL: We conduct reinforcement learning fine-tuning using EasyR1.
Inference: vllm.
Data: We utilized a portion of the GOT10k-train dataset to assemble our training set and conducted one-shot testing exclusively on GOT10k-test.

Citation

@misc{wang2025r1track,
  title        = {R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning},
  author       = {Biao Wang},
  howpublished = {\url{https://github.com/Wangbiao2/R1-Track}},
  year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
EasyR1		EasyR1
figs		figs
generation_data		generation_data
infer_script		infer_script
llamafactory		llamafactory
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

Prompt

Data

Quick Start

Results

Visualization

News

TODO

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Wangbiao2/R1-Track

Folders and files

Latest commit

History

Repository files navigation

R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

Prompt

Data

Quick Start

Results

Visualization

News

TODO

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages