Jinlong Li · Cristiano Saltori · Fabio Poiesi · Nicu Sebe
This repository contains the official PyTorch implementation of the paper "CUA-O3D: Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding" (CVPR 205). The paper is available on Arxiv. The project page is online at CUA-O3D.
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models—such as CLIP, DINOv2, and Stable Diffusion—into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities so as to provide state-of-the-art performance in tasks such as:
- Zero-shot 3D semantic segmentation
- Cross-modal zero-shot segmentation
- Linear probing segmentation
Visit the CUA-O3D website to explore more details about the project, methodology, and results.
- 2D feature extraction release
- distillation training release
- linear probing training release
Requirements
- Python 3.x
- Pytorch 1.7.1
- CUDA 11.x or higher
The following installation suppose python=3.8
pytorch=1.7.1
and cuda=11.x
.
-
Create a conda virtual environment
conda create -n CUA_O3D python=3.8 conda activate CUA_O3D
-
Clone the repository
git clone https://github.com/TyroneLi/CUA_O3D
-
Install the dependencies
-
Install environment dependency
pip install -r requirements.txt
-
Install Pytorch 1.7.1
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
-
Install MinkowskiEngine from scratch
conda install openblas-devel -c anaconda git clone https://github.com/NVIDIA/MinkowskiEngine.git cd MinkowskiEngine python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas
-
Download the ScanNet v2 dataset.
Put the downloaded scans
and scans_test
folder as follows.
CUA_O3D
├── data
│ ├── scannet
│ │ ├── scans
│ │ ├── scans_test
Pre-process ScanNet data from 2d multi-view images by adopting Lseg, DINOv2 and Stable Diffusion models.
cd 2D_feature_extraction/
(1) For LSeg feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_lseg.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_lseg_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
(2) For DINOv2 feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_dinov2.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_DINOv2_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
(3) For Stable Diffusion (SD) feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_sd.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_SD_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
After that, specify the corresponding 2D projection embedding path to the config: data_root_2d_fused_feature data_root_2d_fused_feature_dinov2 data_root_2d_fused_feature_sd
Perform Distillation Training
bash run/distill_with_dinov2_sd_adaptiveWeightLoss_demean.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml
(1) Perform 2D Fusion Evaluation
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
fusion
(2) Perform 2D Distillation Evaluation
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
fusion
(3) Perform 2D Ensemble Evaluation
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
fusion
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_21classes.yaml \
ensemble
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_40classes.yaml \
ensemble
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_80classes.yaml \
ensemble
sh run/evaluate_with_dinov2_sd.py \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_160classes.yaml \
ensemble
(1) Concatenate Lseg, DINOv2 and SD to perform linear probing
sh run/distill_cat_prob_seg_all.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) Lseg head to perform linear probing
sh run/distill_sep_prob_seg_Lseg.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) DINOv2 head to perform linear probing
sh run/distill_sep_prob_seg_DINOv2.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) SD head to perform linear probing
sh run/distill_sep_prob_seg_SD.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
If you use our work in your research, please cite our publication:
@article{li2025cross,
title={Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding},
author={Li, Jinlong and Saltori, Cristiano and Poiesi, Fabio and Sebe, Nicu},
journal={arXiv preprint arXiv:2503.16707},
year={2025}
}
We extend our gratitude to all contributors and supporters of the CUA-O3D project. Your valuable insights and contributions drive innovation and progress in the field of 3D and language-based AI systems.
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
For more information, visit the Creative Commons License page.