JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
- [2025.03.21] Our paper can be found in arXiv.
Install dependencies.
git clone https://github.com/CraftJarvis/JarvisVLA.git
conda create -n mcvla python=3.10
conda activate mcvla
cd JarvisVLA
conda install --channel=conda-forge openjdk=8 -y
pip install -e .
After the installation, you can run the following command to check if the installation is successful and the environment is working:
# After the installation, you can run the following command to check if the installation is successful:
python -m minestudio.simulator.entry # using Xvfb
MINESTUDIO_GPU_RENDER=1 python -m minestudio.simulator.entry # using VirtualGL
You can serve the model with vllm to support multi-GPU and multi-process rollout.
CUDA_VISIBLE_DEVICES=0 vllm serve jarvis_vla_qwen2_vl_7b_sft --port 8000
Then you need to edit the rollout script to the use the correct base_url and port. Finally, you can run the rollout script.
sh scripts/evaluate/rollout-kill.sh
Prepare the dataset and base model, and write their locations in the shell below.
- Single GPU
sh scripts/vla/vla_qwen2_vl_7b_sft.sh
- Multi-GPU
sh scripts/vla/vla_qwen2_vl_7b_sft-multi-GPU.sh
- Multi-Node
sh scripts/vla/vla_qwen2_vl_7b_sft-multi-node.sh
If you find our code or models useful in your work, please cite our paper:
@article{li2025jarvisvla,
title = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse},
author = {Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
journal = {arXiv preprint arXiv:2503.16365},
year = {2025}
}