*Equal contribution. β‘ Corresponding author.
The Chinese University of Hong KongPrevious research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Birdβs-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input.
VG-LLM integrates a 3D visual geometry encoder (based on VGGT) with a conventional 2D visual encoder.
- Input video frames are processed by both encoders. The 2D encoder extracts semantic-aware visual features from individual images. The 3D visual geometry encoder processes the sequence to produce globally geometry-aware visual features, capturing inter-frame correspondences.
- Features from both encoders are fused at the patch level.
- These fused, geometry-augmented visual features, along with text embeddings of a question, are fed into an MLLM backbone (Qwen2.5-VL) to generate a response.
- 3D Visual Grounding (ScanRefer): Our model can directly predict the 3D oriented bounding box at the camera's coordinate system without any 3D data input, and obtains 34.1% Acc@0.25.
- 3D Dense Captioning (Scan2Cap): Achieves competitive results (e.g., 74.1 CIDEr@0.5 on Scan2Cap) without explicit 3D scene data input.
- 3D Video Object Detection (curated from EmbodiedScan): Shows significant recall improvement (e.g., +19.3 F1 for common classes in 6-frame setting) by better handling egocentric-allocentric transformations.
- Spatial Reasoning (VSI-Bench): Our 4B model achieves an average score of 46.1%, surpassing Gemini-1.5-Pro.
- Generic Multimodal Benchmarks (CVBench, VideoMME, BLINK, TempCompass, NextQA): Enhancing spatial understanding incurs negligible loss on general multimodal performance.
Visualization results of VG LLM in 3D visual grounding tasks.
Our model can identify the frame index in which the targer object appears in a video stream, as well as its oriented 3D bounding box in the current frame. In this illustration, we show the video, the model's predicted oriented 3D bounding boxes (highlighted in green), and the ground truth 3D oriented bounding boxes (highlighted in blue). As shown in the figure, our model can effectively identify spatial relationships such as "far away," "opposite," and "next to" based on the video input.Visualization results of VG LLM in 3D video object detection.
Our model can identify all objects througtout a video and output their oriented 3D bounding boxes in the unified coordinate system. As shown in the figure, our model can effectively detect objects of different granularities, including sink, bin, telephone, etc., and output their bounding boxes in a unified coordinate system.-
Clone the repository:
git clone [https://github.com/lavi-lab/VG-LLM](https://github.com/lavi-lab/VG-LLM) cd VG-LLM
-
Create a Conda environment and install dependencies: We recommend using Python 3.10.
conda create -n vgllm python=3.10 conda activate vgllm pip install -e .
VG-LLM is trained and evaluated on a variety of datasets:
- 3D Scene Understanding:
- 3D Visual Grounding: ScanRefer, with 24 uniformly sampling frames per scene.
- 3D Dense Captioning: Scan2Cap, using Mask3D-detected object proposals extracted from LEO. We uniformly sample 16 frames for each scene.
- 3D Video Object Detection: Curated from EmbodiedScan, with consecutive frames sampled at 1 FPS.
- Spatial Reasoning Instruction Tuning:
- SPAR-7M: We used a subset of ~234K samples (3% of original). Data prep follows official codebase, navigation type discarded.
- LLaVA-Video-178K (LLaVA-Hound split): We used a subset of ~63K samples (25% of original). Frames sampled at 2 FPS, 4-8 frames total.
- Evaluation Benchmarks: We adopt VSI-Bench, CV-Bench, BLINK, Video-MME, TempCompass, NextQA for evaluation.
We release the following finetuned models:
VG-LLM-4B (3D Scene Understanding)
: VGLLM_For_3D_Scene_Understanding_4BVG-LLM-4B (Spatial Reasoning)
: VGLLM_for_Spatial_Reasoning_4B
- Download the demo data at this link and place it at
data/demo_data
. - Download the required model checkpoints according to the last section.
- Run the script
demo.ipynb
.
Before starting the training process, you need to download the required datasets and annotations according to the following folder structure.
data
βββ evaluation
β βββ scan2cap
β βββ scanrefer
β βββ threedod
βββ media
β βββ llava_hound
β βββ scannet
β βββ spar
βββ train
βββ llava_hound_255k.json
βββ scan2cap_train_16frames.json
βββ scannet_det_train_4frames.json
βββ scanrefer_train_24frames.json
βββ spar_7m.jsonl
- Annotations: Download the annotation files from VG-LLM-Data.
- Media Data: Prepare preprocessed video frames following the instruction of Video-3D LLM.
- Annotations: Download the annotation files from VG-LLM-Data.
- Video Data: Download the media data of LLaVA-Video-178K (LLaVA-Hound split) from the ShareGPTVideo.
- SPAR Data: Download the media data of SPAR from SPAR-7M.
We have provided two example entries as follows
Example for LLaVA-Video-178K (LLaVA-Hound Split).
{
"id": "23230678_1",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is the contrast provided in the video's midway point?"
},
{
"from": "gpt",
"value": "In the midway point of the video, a handgun is displayed on a surface covered with documents, providing a stark contrast to the earlier images of the cigarette being inhaled."
}
],
"data_source": "llava_hound",
"video": "llava_hound/frames/23230678"
}
Example for SPAR-7M.
{
"id": "scene0012_01_1661",
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\n<image>\nAssume the depth of box (red point) is 2.0. How much deeper or shallower is chair (green point) relative to table (blue point), measured in meters? Calculate or judge based on the 3D center points of these objects. The depth is calculated based on the image where the markers corresponding to these objects are located. Provide a numeric response with just one value."
},
{
"from": "gpt",
"value": "1.5"
}
],
"images": [
"spar/scannet/images/scene0012_01/image_color/2626.jpg",
"spar/scannet/images/scene0012_01/image_color/3321.jpg",
"spar/scannet/images/scene0012_01/image_color/133.jpg"
],
"spar_info": "{\"red_point\": [[395, 89]], \"blue_point\": [[494, 620]], \"green_point\": [[878, 737]], \"point_img_idx\": [[0, 2, 1]], <
8A93
span class="pl-cce">\"type\": \"depth_prediction_oo_mv\"}"
}
Next, you need to configure the data paths in the source code following Qwen-2.5-VL. Modify the src/qwen_vl/data/__init__.py
file to ensure the script can locate your datasets.
annotation_path
: This should point to the JSON or JSONL file containing your downloaded dataset annotations.data_path
: This can be left empty if the image and video paths specified in your annotation files are absolute paths. Otherwise, provide the directory where your data is stored.
We train two models separately for 3D scene understanding and spatial reasoning tasks. The following instructions are for 3D scene understanidng.
To start the training, execute the following script:
bash scripts/train/train_3d.sh
For spatial reasoning, run the following command:
bash scripts/train/train_sr.sh
- Hardware: Our experiments were conducted on a setup with 8x NVIDIA H800 (80G) GPUs.
- Hyperparameters: We trained the model for one epoch using the Adam optimizer with a batch size of 16, a warmup ratio of 0.03, and a learning rate of 5e-6.
- Frozen Components: During training, the visual encoder of the MLLM, the 3D geometry encoder, and the multimodal connector are kept frozen.
- Training Duration:
- 3D Scene Understanding: Approximately 8 hours.
- Spatial Reasoning: Approximately 12 hours.
Evaluation is performed using the LMMs-Eval with greedy sampling for generation. For video benchmarks, 32 frames are uniformly sampled for VSI-Bench.
Please refer to the example evaluation script (scripts/evaluation/eval.sh
) below for detailed command usage. You may need to adjust model_path
, benchmark
, or other parameters based on your specific setup and requirements.
set -e
export LMMS_EVAL_LAUNCHER="accelerate"
export NCCL_NVLS_ENABLE=0
benchmark=vsibench # choices: [vsibench, cvbench, blink_spatial]
output_path=logs/$(TZ="Asia/Shanghai" date "+%Y%m%d")
model_path=zd11024/VGLLM_for_Spatial_Reasoning_4B
accelerate launch --num_processes=8 -m lmms_eval \
--model vgllm \
--model_args pretrained=$model_path,use_flash_attention_2=true,max_num_frames=32,max_length=12800 \
--tasks ${benchmark} \
--batch_size 1 \
--output_path $output_path
For 3D scene understanding, please refer to the script scripts/evaluation/eval_3d.sh
for more details. Notice that for 3D visual grounding, a frame index are asked to insert in front of each frame by setting add_frame_index
to true
.
- Release the model weights.
- Release the inference demo.
- Release the evaluation code, preprocessing data and training scripts for spatial reasoning.
- Release the evaluation code, preprocessing data and training scripts for 3D scene understanding.
If you find our work useful, please consider citing:
@article{zheng2025learning,
title={Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors},
author={Zheng, Duo and Huang, Shijia and Li, Yanyang and Wang, Liwei},
journal={arXiv preprint arXiv:2505.24625},
year={2025}
}
- This work is built upon excellent previous research, including Qwen2.5-VL, VGGT, SPAR-7M, LLaVA-Video-178K, and various 3D datasets like ScanNet, ScanRefer, Scan2Cap, EmbodiedScan.
- We thank the developers of LMMs-Eval for their evaluation framework.