8000 GitHub - robosense2025/track1: Track 1: Driving with Language
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

robosense2025/track1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– RoboSense Track 1: Driving with Language

Official Baseline Implementation for Track 1

Based on DriveBench -- "An Empirical Study from the Reliability, Data, and Metric Perspectives"
(https://github.com/drive-bench/toolkit)

RoboSense Challenge Track IROS 2025 CodaBench License

πŸ† Prize Pool: $2,000 USD for Track 1 Winners

Challenge Overview

Track 1: Drive with Language challenges participants to develop intelligent driving systems that can understand and act upon natural language instructions in dynamic driving environments. In an era where autonomous vehicles must interpret complex human commands, participants will design algorithms that bridge the gap between natural language understanding and driving actions.

The challenge focuses on enabling autonomous agents to process commands involving perception, prediction, and planning. Submissions must address key challenges such as accurate scene perception, safe decision-making, and robustness against visual degradation in complex driving scenarios.

🎯 Objectives

This track evaluates the capability of VLMs to answer high-level driving questions in complex urban environments. Given questions including perception, prediction, and planning, and a multi-view camera input, participants are expected to answer the question given the visually corrupted images.

  • Perception: Understand the scene and answer the question about the scene or objects in the scene.
  • Prediction: Predict the future trajectory of objects in the scene.
  • Planning: Plan the safe driving actions based on the objects in the scene.

Competition Details

πŸ† Awards

Prize Award
πŸ₯‡ 1st Place $1000 + Certificate
πŸ₯ˆ 2nd Place $600 + Certificate
πŸ₯‰ 3rd Place $400 + Certificate
🌟 Innovation Award Cash Award + Certificate
Participation Certificate

πŸ“Š Official Dataset

This track uses the RoboSense Track 1 Drive with Language Dataset, which is based on the DriveLM and DriveBench benchmarks that includes:

  • Multi-view Camera Input: The input is a multi-view camera input from the nuScenes dataset.
  • Language Instruction: The instruction is a natural language instruction, including perception, prediction, and planning.
  • Object Localization: The object is referred to by the center point of the object in the scene.

Dataset Statistics

Driving Tasks Num. of Questions Question Types
Perception 361 MCQs, VQA
Prediction 522 MCQs
Planning 513 VQA

We further distinguish the VQA questions into two types:

  • VQAobj: The question is about the object in the scene.
  • VQAscene: The question is about the general scene.

Baseline Performance (Phase 1)

We use Qwen2.5-VL-7B-Instruct as the baseline model. The baseline performance is as follows:

Task Question Type Accuracy (%)
Perception MCQ 75.5
VQAobj 29.2
VQAscene 22.2
Prediction MCQ 59.2
Planning VQAobj 29.6
VQAscene 31.2
Average All Types 42.5

πŸš€ Quick Start

We provide a simple demo to run the baseline model.

1. Preparing conda env

Assuming you have conda installed, let's prepare a conda env:

conda create -n drive python=3.10
pip install -r requirements.txt

2. Prepare the dataset

First, convert the data format by running:

python convert_format.py <input_file> <output_file> 

You can also include temporal frames by adding the --use-temporal flag with --num-frames <num_frames>.

3. Deploy

We deploy the model using vLLM:

bash service.sh <GPU_NUM>

4. Evaluate the baseline

Simply run:

bash inference.sh

πŸ“¦ Submission Packaging

TBA

⏱ Evaluation Time

TBA

πŸŽ–οΈ Challenge Participation

Submission Requirements

  1. Phase 1: Submit results on clean test set with reproducible code
  2. Phase 2: Submit results on corrupted test set with reproducible code
  3. Code: Submit reproducible code with your final results
  4. Model: Include trained model weights
  5. Report: Technical report describing your approach

πŸ“ Evaluation Metrics

Our benchmark uses the following metrics: Accuracy and LLM Score.

Metric Description
Accuracy Used for all Multi-Choice Questions (MCQs)
LLM Score Used for all Visual Question Answering (VQA), we prompt an LLM to score the answer given detailed rubrics.

Timeline

  • Registration: Google Form
  • Phase 1 Deadline: August 15th
  • Phase 2 Deadline: September 15th
  • Awards Announcement: IROS 2025

πŸ”— Resources

πŸ“§ Contact & Support

πŸ“„ Citation

If you use the code and dataset in your research, please cite:

@article{xie2025drivebench,
  title = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
  author = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
  journal = {arXiv preprint arXiv:2501.04003},
  year = {2025}
}
@inproceedings{sima2024drivelm,
  title = {DriveLM: Driving with graph visual question answering},
  author = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Bei{\ss}wenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang},
  booktitle = {European Conference on Computer Vision},
  pages = {256-274},
  year = {2024},
  organization = {Springer}
}

Acknowledgements

RoboSense 2025 Challenge Organizers

RoboSense 2025 Program Committee


πŸ€– Ready to sense the world robustly? Register now and compete for $2,000!

πŸ“ Register Here | 🌐 Challenge Website | πŸ“§ Contact Us

Made with ❀️ by the RoboSense 2025 Team

0