8000 GitHub - ai4ce/INT-ACT: Official repo for From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ INT-ACT Public

Official repo for From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Notifications You must be signed in to change notification settings

ai4ce/INT-ACT

Repository files navigation

This README is functionally complete as of 06/26/2025. So if you find something missing, please open an issue and we will take care of it as soon as possible.

🆕 [2025-6-26] Tutorial for Evaluation Updated

🆕 [2025-6-15] Model Checkpoints Uploaded. Tutorial for Training/Fine-tuning Updated

🆕 [2025-6-12] Made Public.

INT-ACT

[Page] | [Paper]

This is the official implementation of From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Table of Contents

TODO

  • Add more complete documentation for training and evaluation. Currently, the code is all there, but the documentation is sparse.

  • Release all relevant model checkpoints on HF

Installation

Install this codebase by first cloning it.

git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
cd INT-ACT

Important

See the How to Set ENV Variables section for setting up the environment variables.

Note

This codebase relies on uv to manage the virtual environments. It's not strictly required, but the authors can only provide support for this environment management system.

Now simply run

uv sync

Important

This only installed the dependency for training and inference server. Full-scale inference requires installing the inference client (simulator) dependencies.

Inference under different environments, such as Simpler, Simpler-ManiSkill3, Libero, or real world requires installing their own dependency in a separate environment.

Note

Server refers to the policy ($\pi_0$, Octo, etc). Client refers to simulator (Maniskill) and real-world robots. Simulator/Client will feed its observations to the server to retrieve the action to execute.

We do this to allow a server-client architecture that can separate the very different compute demands of doing training vs. running experiments on a robot

Install Inference Environment

Install Inference Client (Simulator) Environment

This example will use Simpler as an example.

  1. Create a separate virtual environment for this simulator.
cd src/experiments/envs/simpler
uv venv --python=3.10 # The version can change to accommodate your simulator's need.
  1. Activate the inference virtual environment. This is important because we don't want to install the simulator dependencies in the training environment.
source .venv/bin/activate
  1. Install the simulator dependencies using pyproject.toml
uv pip install -r pyproject.toml

(Octo and Magma) Install Inference Server (Policy) Environment

Octo and Magma both requires specialized policy environments due to conflicting dependencies. This example will use Octo as an example.

  1. Create a separate virtual environment for this policy.
cd src/experiments/policies/octo_policy_server
uv venv --python=3.10 # The version can change to accommodate your simulator's need.
  1. Activate the inference virtual environment. This is important because we don't want to install the policy dependencies in the training environment.
source .venv/bin/activate
  1. Install the policy dependencies using pyproject.toml
uv pip install -r pyproject.toml

Acquire Data for Training/Fine-tuning

For now, we refer you to Allen Ren's README

Acquire Checkpoints for Evaluation

We released our trained Pi0 variants on Huggingface. You can find them under the INTACT collection. Specifically, they are:

Model Notes Download Link
Pi0 finetune Pi0 finetuned on BridgeV2 HF hub
Pi0 finetune rephrase Pi0 finetuned on BridgeV2 with task paraphrase HF hub
Pi0 scratch Pi0 trained from scratch on BridgeV2 HF hub

You can find the details in each checkpoint's model card.

For convenience, we also include links to the baselines which have been generously provided by their original authors:

Model Reference Download Link
Magma Magma: A Foundation Model for Multimodal AI Agents CVPR 2025 HF hub
SpatialVLA SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model RSS 2025 HF hub
Octo models Octo: An Open-Source Generalist Robot Policy RSS 2024 Small (HF) / Base (HF)

Train and Fine-tune

The documentation can be found in doc/training_finetuning.md.

Evaluate/Benchmark

The documentation can be found in doc/evaluation.md.

How to Set ENV Variables

  1. Create a set_path.sh file in the project's root directory
  2. Fill out the following variables
#!/bin/bash
# used to sync the path on HPC with data from collaborators and the model from the baseline directory
# to avoid redundant data download
# training dataset
export VLA_DATA_DIR=

# logging for trained models, logs, etc
export VLA_LOG_DIR=

# WandB
export VLA_WANDB_ENTITY=

# HF cache. TRANSFORMERS_CACHE is deprecated, but still used by some libraries and they themselves a bit confused tbh
export TRANSFORMERS_CACHE=
export HF_HOME=

# SIMPLER
export MS2_REAL2SIM_ASSET_DIR=
export MS_ASSET_DIR=
export XLA_PYTHON_CLIENT_PREALLOCATE=false

# uv (This is optional if you don't mind uv using your home directory, which may not be the case for HPC)
export UV_CACHE_DIR=
export UV_PYTHON_INSTALL_DIR=

# Singularity (This is obviously optional if you don't use Singularity)
export OVERLAY_EXT3=

About

Official repo for From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0