Official Baseline Implementation for Track 1
Based on DriveBench -- "An Empirical Study from the Reliability, Data, and Metric Perspectives"
(https://github.com/drive-bench/toolkit)
π Prize Pool: $2,000 USD for Track 1 Winners
Track 1: Drive with Language challenges participants to develop intelligent driving systems that can understand and act upon natural language instructions in dynamic driving environments. In an era where autonomous vehicles must interpret complex human commands, participants will design algorithms that bridge the gap between natural language understanding and driving actions.
The challenge focuses on enabling autonomous agents to process commands involving perception, prediction, and planning. Submissions must address key challenges such as accurate scene perception, safe decision-making, and robustness against visual degradation in complex driving scenarios.
This track evaluates the capability of VLMs to answer high-level driving questions in complex urban environments. Given questions including perception, prediction, and planning, and a multi-view camera input, participants are expected to answer the question given the visually corrupted images.
- Perception: Understand the scene and answer the question about the scene or objects in the scene.
- Prediction: Predict the future trajectory of objects in the scene.
- Planning: Plan the safe driving actions based on the objects in the scene.
- Venue: IROS 2025, Hangzhou (Oct 19-25, 2025)
- Registration: Google Form (Open until Aug 15)
- Contact: robosense2025@gmail.com
Prize | Award |
---|---|
π₯ 1st Place | $1000 + Certificate |
π₯ 2nd Place | $600 + Certificate |
π₯ 3rd Place | $400 + Certificate |
π Innovation Award | Cash Award + Certificate |
Participation | Certificate |
This track uses the RoboSense Track 1 Drive with Language Dataset, which is based on the DriveLM and DriveBench benchmarks that includes:
- Multi-view Camera Input: The input is a multi-view camera input from the nuScenes dataset.
- Language Instruction: The instruction is a natural language instruction, including perception, prediction, and planning.
- Object Localization: The object is referred to by the center point of the object in the scene.
Driving Tasks | Num. of Questions | Question Types |
---|---|---|
Perception | 361 | MCQs, VQA |
Prediction | 522 | MCQs |
Planning | 513 | VQA |
We further distinguish the VQA questions into two types:
- VQAobj: The question is about the object in the scene.
- VQAscene: The question is about the general scene.
We use Qwen2.5-VL-7B-Instruct
as the baseline model. The baseline performance is as follows:
Task | Question Type | Accuracy (%) |
---|---|---|
Perception | MCQ | 75.5 |
VQAobj | 29.2 | |
VQAscene | 22.2 | |
Prediction | MCQ | 59.2 |
Planning | VQAobj | 29.6 |
VQAscene | 31.2 | |
Average | All Types | 42.5 |
We provide a simple demo to run the baseline model.
Assuming you have conda installed, let's prepare a conda env:
conda create -n drive python=3.10
pip install -r requirements.txt
First, convert the data format by running:
python convert_format.py <input_file> <output_file>
You can also include temporal frames by adding the --use-temporal
flag with --num-frames <num_frames>
.
We deploy the model using vLLM:
bash service.sh <GPU_NUM>
Simply run:
bash inference.sh
TBA
TBA
- Phase 1: Submit results on clean test set with reproducible code
- Phase 2: Submit results on corrupted test set with reproducible code
- Code: Submit reproducible code with your final results
- Model: Include trained model weights
- Report: Technical report describing your approach
Our benchmark uses the following metrics: Accuracy and LLM Score.
Metric | Description |
---|---|
Accuracy | Used for all Multi-Choice Questions (MCQs) |
LLM Score | Used for all Visual Question Answering (VQA), we prompt an LLM to score the answer given detailed rubrics. |
- Registration: Google Form
- Phase 1 Deadline: August 15th
- Phase 2 Deadline: September 15th
- Awards Announcement: IROS 2025
- Challenge Website: robosense2025.github.io
- Track Details: Track 1 Page
- Track Dataset: HuggingFace Dataset
- Baseline Model: HuggingFace Dataset
- Related Paper: arXiv:2501.04003
- Email: robosense2025@gmail.com
- Official Website: https://robosense2025.github.io
- Issues: Please use GitHub Issues for technical questions
If you use the code and dataset in your research, please cite:
@article{xie2025drivebench,
title = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
author = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
journal = {arXiv preprint arXiv:2501.04003},
year = {2025}
}
@inproceedings{sima2024drivelm,
title = {DriveLM: Driving with graph visual question answering},
author = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Bei{\ss}wenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang},
booktitle = {European Conference on Computer Vision},
pages = {256-274},
year = {2024},
organization = {Springer}
}
π€ Ready to sense the world robustly? Register now and compete for $2,000!
π Register Here | π Challenge Website | π§ Contact Us
Made with β€οΈ by the RoboSense 2025 Team