JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Updates

[2025.03.21] Our paper can be found in arXiv.

Installation

Install dependencies.

git clone https://github.com/CraftJarvis/JarvisVLA.git
conda create -n mcvla python=3.10
conda activate mcvla
cd JarvisVLA
conda install --channel=conda-forge openjdk=8 -y
pip install -e .

After the installation, you can run the following command to check if the installation is successful and the environment is working:

# After the installation, you can run the following command to check if the installation is successful:
python -m minestudio.simulator.entry # using Xvfb
MINESTUDIO_GPU_RENDER=1 python -m minestudio.simulator.entry # using VirtualGL

Inference

You can serve the model with vllm to support multi-GPU and multi-process rollout.

CUDA_VISIBLE_DEVICES=0 vllm serve jarvis_vla_qwen2_vl_7b_sft --port 8000

Then you need to edit the rollout script to the use the correct base_url and port. Finally, you can run the rollout script.

sh scripts/evaluate/rollout-kill.sh

Train

Prepare the dataset and base model, and write their locations in the shell below.

Single GPU

sh scripts/vla/vla_qwen2_vl_7b_sft.sh

Multi-GPU

sh scripts/vla/vla_qwen2_vl_7b_sft-multi-GPU.sh

Multi-Node

sh scripts/vla/vla_qwen2_vl_7b_sft-multi-node.sh

Citation

If you find our code or models useful in your work, please cite our paper:

@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
  journal = {arXiv preprint arXiv:2503.16365}, 
  year    = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
jarvisvla		jarvisvla
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Updates

Installation

Inference

Train

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

CraftJarvis/JarvisVLA

Folders and files

Latest commit

History

Repository files navigation

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Updates

Installation

Inference

Train

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages