A general-purpose VLA Model designed to unify vision, language, and action for robotics and autonomous driving.
📜 [technical report] 🤗 [model weights] 🤖 [project page]
- 2025.6.27: code released for robotic simulations.
- 2025.6.25: paper released on the arXiv.
- Unified Vision-Language-Action Model: supports image grounding, video generation, and action prediction.
- Strong Performance on Several Robotics Benchmarks: support CALVIN, LIBERO, SimplerEnv.
- Interleaved Video Training: support interleaved vision-action training in Markov Decision Process.
- Broader Applications: Real-robot ALOHA & Autonomous Driving.
- Policy learning for CALVIN, LIBERO, and SimplerEnv.
- Support for evaluation.
- World model pretraining for video generation.
- Support for real-robot ALOHA.
- Support for autonomous driving.
- Support for general grounding.
You can download the pretraining models from HuggingFace, here we provide the links.
# train the world model
bash scripts/pretrain/train_video_1node.sh
This model is used to serve as the prerained model for the downstream policy learning tasks, such as CALVIN, LIBERO, and SimplerEnv.
Method Mode Setting AVG CKPT UniVLA video sft ABCD->D 4.63 (5x:4.71) huggingface Note: 5× means 5× inference steps, i.e., 180 steps total.
- Here provide single node training script, recommend multi-node training.
# video sft bash scripts/simulator/calvin/train_calvin_abcd_video.sh
Method Mode SPATIAL OBJECTS GOAL 10 AVG CKPT UniVLA img sft 97.0 99.0 92.6 90.8 94.8 huggingface UniVLA video sft 95.4 98.8 93.6 94.0 95.5 huggingface bash scripts/simulator/libero/train_libero_video.sh
Method Robot Mode Put Spoon Put Carrot Stack Block Put Eggplant AVG CKPT UniVLA Bridge(WidowX) video sft 83.3 66.7 33.3 95.8 69.8 huggingface bash scripts/simulator/simplerenv/train_simplerenv_bridge_video.shHere we provide a conda environment setup for the project.
conda create -n emu_vla python=3.10 pip install -r requirements.txt OmniSim/ ├── configs/ # Model configuration files ├── models/ # Tokenizer and diffusion test ├── train/ # Training dataset and pipeline ├── reference/ # Reference code │ ├── Emu3/ # Base code │ └── RoboVLMs/ # Evaluation code ├── scripts/ # Shell scripts for training & evaluation ├── tools/ # Data preprocessing tools └── README.md # Project description and user guideOur work is built upon the following projects, Thanks for their great open-source work!