DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

This is the official repository of DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering, accepted by CVPR2025.

🏠 Abstract

3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet.

📚 Installation

Please refer to installation guide.

📋 Dataset

Please refer to data preparation for preparing the ScanNet v2, ScanQA and SQA3D datasets.

🤖 Usage

Training

(Optional) Start training the poinnet++ in the object detection task of scannet dataset, or you can directly download the checkpoint. We found that the lack of pretraining for PointNet++ has little impact on the ScanQA and SQA task.
```
  CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 tools/train.py \
  configs/scannet-det/scannet-votenet-12xb12.py --work-dir=work_dirs/scannet-det/scannet-votenet --launcher pytorch
```

Training DSPNet on ScanQA task:

  CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 tools/train.py \
  configs/scanqa/mv-scanqa-pointnetpp-swin-sbert-12xb12.py --work-dir=work_dirs/mv-scanqa/mv-scanqa-pointnetpp-swin-sbert-12xb12 --launcher pytorch

Training DSPNet on SQA task:

  CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 tools/train.py \
  configs/sqa/mv-sqa-pointnetpp-swin-sbert-12xb12.py --work-dir=work_dirs/mv-sqa/mv-sqa-pointnetpp-swin-sbert-12xb12 --launcher pytorch

For ease of analysis, we incorporate the evaluation into the end of each interval epochs of training.

Inference

Evaluation of trained models with the ScanQA test dataset:

  CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 tools/test.py \
  configs/scanqa/mv-scanqa-pointnetpp-swin-sbert-12xb12.py work_dirs/mv-scanqa/mv-scanqa-pointnetpp-swin-sbert-12xb12/best_EM@1_epoch_<epoch_id>.pth \ 
  --work-dir=work_dirs/scanqa_test_w_object  --launcher pytorch

The ScanQA benchmark is hosted on EvalAI. Please submit the work_dirs/scanqa_test_{split}/test_result.json to this site for the evaluation of the test with and without objects. You can set up the test split by modifying the ann_file and qa_file of test_dataloader in mv-scanqa-pointnetpp-swin-sbert-12xb12.py.

Evaluation of trained models with the SQA3D test dataset:

  CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 tools/test.py \
  configs/sqa/mv-sqa-pointnetpp-swin-sbert-12xb12.py work_dirs/mv-sqa/mv-sqa-pointnetpp-swin-sbert-12xb12/best_EM@1_epoch_<epoch_id>.pth \ 
  --work-dir=work_dirs/sqa_test  --launcher pytorch

Due to the inherent randomness of multi-gpu distribution, the re-evaluation results may be slightly different from the evaluation results during training (We use the evaluation results during training in our report).

📦 Checkpoints

Checkpoint	Link	Note
VoteNet	link	VoteNet Pre-trained on ScanNet.
DSPNet-ScanQA	link	Trained on ScanQA.
DSPNet-SQA	link	Trained on SQA.

🔗 Citation

If you find our work hel 66A5 pful for your research. Please consider citing our paper.

@misc{luo2025dspnet,
      title={DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering}, 
      author={Jingzhou Luo and Yang Liu and Weixing Chen and Zhen Li and Yaowei Wang and Guanbin Li and Liang Lin},
      year={2025},
      eprint={2503.03190},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03190}, 
}

📄 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

👏 Acknowledgements

We would like to thank EmbodiedScan for the codebase of 3D training pipeline and mmdetection3d for the codebase of pointnet++ and votenet.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data/scannet		data/scannet
docs		docs
embodiedqa		embodiedqa
requirements		requirements
tools		tools
work_dirs		work_dirs
.gitignore		.gitignore
README.md		README.md
install.py		install.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

🏠 Abstract

📚 Installation

📋 Dataset

🤖 Usage

Training

Inference

📦 Checkpoints

🔗 Citation

📄 License

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

LZ-CH/DSPNet

Folders and files

Latest commit

History

Repository files navigation

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

🏠 Abstract

📚 Installation

📋 Dataset

🤖 Usage

Training

Inference

📦 Checkpoints

🔗 Citation

📄 License

👏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages