🤖 RoboSense Track 1: Driving with Language

Official Baseline Implementation for Track 1

Based on DriveBench -- "An Empirical Study from the Reliability, Data, and Metric Perspectives"
(https://github.com/drive-bench/toolkit)

🏆 Prize Pool: $2,000 USD for Track 1 Winners

Challenge Overview

Track 1: Drive with Language challenges participants to develop intelligent driving systems that can understand and act upon natural language instructions in dynamic driving environments. In an era where autonomous vehicles must interpret complex human commands, participants will design algorithms that bridge the gap between natural language understanding and driving actions.

The challenge focuses on enabling autonomous agents to process commands involving perception, prediction, and planning. Submissions must address key challenges such as accurate scene perception, safe decision-making, and robustness against visual degradation in complex driving scenarios.

🎯 Objectives

This track evaluates the capability of VLMs to answer high-level driving questions in complex urban environments. Given questions including perception, prediction, and planning, and a multi-view camera input, participants are expected to answer the question given the visually corrupted images.

Perception: Understand the scene and answer the question about the scene or objects in the scene.
Prediction: Predict the future trajectory of objects in the scene.
Planning: Plan the safe driving actions based on the objects in the scene.

Competition Details

Venue: IROS 2025, Hangzhou (Oct 19-25, 2025)
Registration: Google Form (Open until Aug 15)
Contact: robosense2025@gmail.com

🏆 Awards

Prize	Award
🥇 1st Place	$1000 + Certificate
🥈 2nd Place	$600 + Certificate
🥉 3rd Place	$400 + Certificate
🌟 Innovation Award	Cash Award + Certificate
Participation	Certificate

📊 Official Dataset

This track uses the RoboSense Track 1 Drive with Language Dataset, which is based on the DriveLM and DriveBench benchmarks that includes:

Multi-view Camera Input: The input is a multi-view camera input from the nuScenes dataset.
Language Instruction: The instruction is a natural language instruction, including perception, prediction, and planning.
Object Localization: The object is referred to by the center point of the object in the scene.

Dataset Statistics

Driving Tasks	Num. of Questions	Question Types
Perception	361	MCQs, VQA
Prediction	522	MCQs
Planning	513	VQA

We further distinguish the VQA questions into two types:

VQA_obj: The question is about the object in the scene.
VQA_scene: The question is about the general scene.

Baseline Performance (Phase 1)

We use Qwen2.5-VL-7B-Instruct as the baseline model. The baseline performance is as follows:

Task	Question Type	Accuracy (%)
Perception	MCQ	75.5
	VQA_obj	29.2
	VQA_scene	22.2
Prediction	MCQ	59.2
Planning	VQA_obj	29.6
	VQA_scene	31.2
Average	All Types	42.5

🚀 Quick Start

We provide a simple demo to run the baseline model.

1. Preparing conda env

Assuming you have conda installed, let's prepare a conda env:

conda create -n drive python=3.10
pip install -r requirements.txt

2. Prepare the dataset

First, convert the data format by running:

python convert_format.py <input_file> <output_file>

You can also include temporal frames by adding the --use-temporal flag with --num-frames <num_frames>.

3. Deploy

We deploy the model using vLLM:

bash service.sh <GPU_NUM>

4. Evaluate the baseline

Simply run:

bash inference.sh

📦 Submission Packaging

TBA

⏱ Evaluation Time

TBA

🎖️ Challenge Participation

Submission Requirements

Phase 1: Submit results on clean test set with reproducible code
Phase 2: Submit results on corrupted test set with reproducible code
Code: Submit reproducible code with your final results
Model: Include trained model weights
Report: Technical report describing your approach

📏 Evaluation Metrics

Our benchmark uses the following metrics: Accuracy and LLM Score.

Metric	Description
Accuracy	Used for all Multi-Choice Questions (MCQs)
LLM Score	Used for all Visual Question Answering (VQA), we prompt an LLM to score the answer given detailed rubrics.

Timeline

Registration: Google Form
Phase 1 Deadline: August 15th
Phase 2 Deadline: September 15th
Awards Announcement: IROS 2025

🔗 Resources

Challenge Website: robosense2025.github.io
Track Details: Track 1 Page
Track Dataset: HuggingFace Dataset
Baseline Model: HuggingFace Dataset
Related Paper: arXiv:2501.04003

📧 Contact & Support

Email: robosense2025@gmail.com
Official Website: https://robosense2025.github.io
Issues: Please use GitHub Issues for technical questions

📄 Citation

If you use the code and dataset in your research, please cite:

@article{xie2025drivebench,
  title = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
  author = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
  journal = {arXiv preprint arXiv:2501.04003},
  year = {2025}
}

@inproceedings{sima2024drivelm,
  title = {DriveLM: Driving with graph visual question answering},
  author = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Bei{\ss}wenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang},
  booktitle = {European Conference on Computer Vision},
  pages = {256-274},
  year = {2024},
  organization = {Springer}
}

Acknowledgements

RoboSense 2025 Challenge Organizers

RoboSense 2025 Program Committee

🤖 Ready to sense the world robustly? Register now and compete for $2,000!

📝 Register Here | 🌐 Challenge Website | 📧 Contact Us

Made with ❤️ by the RoboSense 2025 Team

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs/figures		docs/figures
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
inference.sh		inference.sh
requirements 10000 .txt		requirements.txt
service.sh		service.sh
visual.py		visual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 RoboSense Track 1: Driving with Language

Challenge Overview

🎯 Objectives

Competition Details

🏆 Awards

📊 Official Dataset

Dataset Statistics

Baseline Performance (Phase 1)

🚀 Quick Start

1. Preparing conda env

2. Prepare the dataset

3. Deploy

4. Evaluate the baseline

📦 Submission Packaging

⏱ Evaluation Time

🎖️ Challenge Participation

Submission Requirements

📏 Evaluation Metrics

Timeline

🔗 Resources

📧 Contact & Support

📄 Citation

Acknowledgements

RoboSense 2025 Challenge Organizers

RoboSense 2025 Program Committee

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

robosense2025/track1

Folders and files

Latest commit

History

Repository files navigation

🤖 RoboSense Track 1: Driving with Language

Challenge Overview

🎯 Objectives

Competition Details

🏆 Awards

📊 Official Dataset

Dataset Statistics

Baseline Performance (Phase 1)

🚀 Quick Start

1. Preparing conda env

2. Prepare the dataset

3. Deploy

4. Evaluate the baseline

📦 Submission Packaging

⏱ Evaluation Time

🎖️ Challenge Participation

Submission Requirements

📏 Evaluation Metrics

Timeline

🔗 Resources

📧 Contact & Support

📄 Citation

Acknowledgements

RoboSense 2025 Challenge Organizers

RoboSense 2025 Program Committee

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages