Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng^*, Shijia Huang^*, Yanyang Li and Liwei Wang^‡

^*Equal contribution. ^‡ Corresponding author.

The Chinese University of Hong Kong

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird’s-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input.

✨Architecture Overview

VG-LLM integrates a 3D visual geometry encoder (based on VGGT) with a conventional 2D visual encoder.

Input video frames are processed by both encoders. The 2D encoder extracts semantic-aware visual features from individual images. The 3D visual geometry encoder processes the sequence to produce globally geometry-aware visual features, capturing inter-frame correspondences.
Features from both encoders are fused at the patch level.
These fused, geometry-augmented visual features, along with text embeddings of a question, are fed into an MLLM backbone (Qwen2.5-VL) to generate a response.

The architecture of our VG LLM.

🚀Main Results Highlights

3D Visual Grounding (ScanRefer): Our model can directly predict the 3D oriented bounding box at the camera's coordinate system without any 3D data input, and obtains 34.1% Acc@0.25.
3D Dense Captioning (Scan2Cap): Achieves competitive results (e.g., 74.1 CIDEr@0.5 on Scan2Cap) without explicit 3D scene data input.
3D Video Object Detection (curated from EmbodiedScan): Shows significant recall improvement (e.g., +19.3 F1 for common classes in 6-frame setting) by better handling egocentric-allocentric transformations.
Spatial Reasoning (VSI-Bench): Our 4B model achieves an average score of 46.1%, surpassing Gemini-1.5-Pro.
Generic Multimodal Benchmarks (CVBench, VideoMME, BLINK, TempCompass, NextQA): Enhancing spatial understanding incurs negligible loss on general multimodal performance.

Visualization results of VG LLM in 3D visual grounding tasks.

Our model can identify the frame index in which the targer object appears in a video stream, as well as its oriented 3D bounding box in the current frame. In this illustration, we show the video, the model's predicted oriented 3D bounding boxes (highlighted in green), and the ground truth 3D oriented bounding boxes (highlighted in blue). As shown in the figure, our model can effectively identify spatial relationships such as "far away," "opposite," and "next to" based on the video input.

Visualization results of VG LLM in 3D video object detection.

Our model can identify all objects througtout a video and output their oriented 3D bounding boxes in the unified coordinate system. As shown in the figure, our model can effectively detect objects of different granularities, including sink, bin, telephone, etc., and output their bounding boxes in a unified coordinate system.

⚙️Setup

Clone the repository:

git clone [https://github.com/lavi-lab/VG-LLM](https://github.com/lavi-lab/VG-LLM)
cd VG-LLM

Create a Conda environment and install dependencies: We recommend using Python 3.10.
```
conda create -n vgllm python=3.10
conda activate vgllm
pip install -e .
```

📊Datasets

VG-LLM is trained and evaluated on a variety of datasets:

3D Scene Understanding:
- 3D Visual Grounding: ScanRefer, with 24 uniformly sampling frames per scene.
- 3D Dense Captioning: Scan2Cap, using Mask3D-detected object proposals extracted from LEO. We uniformly sample 16 frames for each scene.
- 3D Video Object Detection: Curated from EmbodiedScan, with consecutive frames sampled at 1 FPS.
Spatial Reasoning Instruction Tuning:
- SPAR-7M: We used a subset of ~234K samples (3% of original). Data prep follows official codebase, navigation type discarded.
- LLaVA-Video-178K (LLaVA-Hound split): We used a subset of ~63K samples (25% of original). Frames sampled at 2 FPS, 4-8 frames total.
- Evaluation Benchmarks: We adopt VSI-Bench, CV-Bench, BLINK, Video-MME, TempCompass, NextQA for evaluation.

Finetuned Models

We release the following finetuned models:

VG-LLM-4B (3D Scene Understanding): VGLLM_For_3D_Scene_Understanding_4B
VG-LLM-4B (Spatial Reasoning): VGLLM_for_Spatial_Reasoning_4B

Demo

Download the demo data at this link and place it at data/demo_data.
Download the required model checkpoints according to the last section.
Run the script demo.ipynb.

Data Preparation

1. Structure

Before starting the training process, you need to download the required datasets and annotations according to the following folder structure.

data
├── evaluation
│   ├── scan2cap
│   ├── scanrefer
│   └── threedod
├── media
│   ├── llava_hound
│   ├── scannet
│   └── spar
└── train
    ├── llava_hound_255k.json
    ├── scan2cap_train_16frames.json
    ├── scannet_det_train_4frames.json
    ├── scanrefer_train_24frames.json
    └── spar_7m.jsonl

2. Data for 3D Scene Understanding

Annotations: Download the annotation files from VG-LLM-Data.
Media Data: Prepare preprocessed video frames following the instruction of Video-3D LLM.

3. Data for Spatial Reasoning

Annotations: Download the annotation files from VG-LLM-Data.
Video Data: Download the media data of LLaVA-Video-178K (LLaVA-Hound split) from the ShareGPTVideo.
SPAR Data: Download the media data of SPAR from SPAR-7M.

We have provided two example entries as follows

Example for LLaVA-Video-178K (LLaVA-Hound Split).

{
    "id": "23230678_1",
    "conversations": [
        {
            "from": "human",
            "value": "<video>\nWhat is the contrast provided in the video's midway point?"
        },
        {
            "from": "gpt",
            "value": "In the midway point of the video, a handgun is displayed on a surface covered with documents, providing a stark contrast to the earlier images of the cigarette being inhaled."
        }
    ],
    "data_source": "llava_hound",
    "video": "llava_hound/frames/23230678"
}

Example for SPAR-7M.

{
    "id": "scene0012_01_1661",
    "conversations": [
        {
            "from": "human",
            "value": "<image>\n<image>\n<image>\nAssume the depth of box (red point) is 2.0. How much deeper or shallower is chair (green point) relative to table (blue point), measured in meters? Calculate or judge based on the 3D center points of these objects. The depth is calculated based on the image where the markers corresponding to these objects are located. Provide a numeric response with just one value."
        },
        {
            "from": "gpt",
            "value": "1.5"
        }
    ],
    "images": [
        "spar/scannet/images/scene0012_01/image_color/2626.jpg",
        "spar/scannet/images/scene0012_01/image_color/3321.jpg",
        "spar/scannet/images/scene0012_01/image_color/133.jpg"
    ],
    "spar_info": "{\"red_point\": [[395, 89]], \"blue_point\": [[494, 620]], \"green_point\": [[878, 737]], \"point_img_idx\": [[0, 2, 1]], <
8A93
span class="pl-cce">\"type\": \"depth_prediction_oo_mv\"}"
}

4. Configure Data Paths

Next, you need to configure the data paths in the source code following Qwen-2.5-VL. Modify the src/qwen_vl/data/__init__.py file to ensure the script can locate your datasets.

annotation_path: This should point to the JSON or JSONL file containing your downloaded dataset annotations.
data_path: This can be left empty if the image and video paths specified in your annotation files are absolute paths. Otherwise, provide the directory where your data is stored.

Training

We train two models separately for 3D scene understanding and spatial reasoning tasks. The following instructions are for 3D scene understanidng.

To start the training, execute the following script:

bash scripts/train/train_3d.sh

For spatial reasoning, run the following command:

bash scripts/train/train_sr.sh

Training Details

Hardware: Our experiments were conducted on a setup with 8x NVIDIA H800 (80G) GPUs.
Hyperparameters: We trained the model for one epoch using the Adam optimizer with a batch size of 16, a warmup ratio of 0.03, and a learning rate of 5e-6.
Frozen Components: During training, the visual encoder of the MLLM, the 3D geometry encoder, and the multimodal connector are kept frozen.
Training Duration:
- 3D Scene Understanding: Approximately 8 hours.
- Spatial Reasoning: Approximately 12 hours.

Evaluation

Evaluation is performed using the LMMs-Eval with greedy sampling for generation. For video benchmarks, 32 frames are uniformly sampled for VSI-Bench.

Please refer to the example evaluation script (scripts/evaluation/eval.sh) below for detailed command usage. You may need to adjust model_path, benchmark, or other parameters based on your specific setup and requirements.

set -e
export LMMS_EVAL_LAUNCHER="accelerate"

export NCCL_NVLS_ENABLE=0
benchmark=vsibench # choices: [vsibench, cvbench, blink_spatial]
output_path=logs/$(TZ="Asia/Shanghai" date "+%Y%m%d")
model_path=zd11024/VGLLM_for_Spatial_Reasoning_4B

accelerate launch --num_processes=8 -m lmms_eval \
    --model vgllm \
    --model_args pretrained=$model_path,use_flash_attention_2=true,max_num_frames=32,max_length=12800 \
    --tasks ${benchmark} \
    --batch_size 1 \
    --output_path $output_path

For 3D scene understanding, please refer to the script scripts/evaluation/eval_3d.sh for more details. Notice that for 3D visual grounding, a frame index are asked to insert in front of each frame by setting add_frame_index to true.

📋Todo List

Release the model weights.
Release the inference demo.
Release the evaluation code, preprocessing data and training scripts for spatial reasoning.
Release the evaluation code, preprocessing data and training scripts for 3D scene understanding.

Citation

If you find our work useful, please consider citing:

@article{zheng2025learning,
  title={Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors},
  author={Zheng, Duo and Huang, Shijia and Li, Yanyang and Wang, Liwei},
  journal={arXiv preprint arXiv:2505.24625},
  year={2025}
}

Acknowledgements

This work is built upon excellent previous research, including Qwen2.5-VL, VGGT, SPAR-7M, LLaVA-Video-178K, and various 3D datasets like ScanNet, ScanRefer, Scan2Cap, EmbodiedScan.
We thank the developers of LMMs-Eval for their evaluation framework.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
demo.ipynb		demo.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

✨Architecture Overview

🚀Main Results Highlights

⚙️Setup

📊Datasets

Finetuned Models

Demo

Data Preparation

1. Structure

2. Data for 3D Scene Understanding

3. Data for Spatial Reasoning

4. Configure Data Paths

Training

Training Details

Evaluation

📋Todo List

Citation

Acknowledgements

About

Uh oh!

Languages

LaVi-Lab/VG-LLM

Folders and files

Latest commit

History

Repository files navigation

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

✨Architecture Overview

🚀Main Results Highlights

⚙️Setup

📊Datasets

Finetuned Models

Demo

Data Preparation

1. Structure

2. Data for 3D Scene Understanding

3. Data for Spatial Reasoning

4. Configure Data Paths

Training

Training Details

Evaluation

📋Todo List

Citation

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages