DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Mert Bulent Sariyildiz · Philippe Weinzaepfel · Thomas Lucas · Pau de Jorge · Diane Larlus · Yannis Kalantidis

NAVER LABS Europe

CVPR 2025

[ArXiv] · [Citation]

Easy navigation

Model overview
Pre-trained models
Installation
Training models
Evaluating models
PCA visualization
Citation

Summary

DUNE is a vision encoder trained via multi-teacher distillation. Specifically, DUNE encoders are distilled using three heterogeneous pre-trained vision models as teachers: DINOv2, MASt3R and Multi-HMR. We use 19 datasets for distillation, covering the visual domains of all three teachers and comprising approximately 20.7 million images in total. The full list of datasets is provided below and also provided in Table 5 of the paper. For all teachers, we used their publicly available ViT-Large checkpoints.

By using DUNE, you can achieve strong performance on a range of 2D and 3D downstream tasks, like monocular depth, semantic segmentation, multi-view depth estimation, multi-human mesh recovery, multi-view pose regression and 3D reconstruction. Notably, MASt3R model with a DUNE encoder achieves a new state-of-the art performance in Map-free Visual Relocalization, improving over the original MASt3R with a much smaller encoder.

Model overview

Pre-trained models

Architecture	Resolution	Checkpoint	Sem.Seg. ADE20K	Sem.Seg. CityScapes	Sem.Seg. NYU	Sem.Seg. Scannet	Depth NYUd	BEDLAM (PA-PVE)	MapFree (AUC)
ViT-Base/14 (420MB)	336	dune_vitbase14_336.pth	45.0	69.3	66.9	64.6	0.384	64.3	94.1
ViT-Base/14	448	dune_vitbase14_448.pth	46.2	71.3	68.3	65.4	0.365	60.1	94.2
ViT-Base/14*	448	dune_vitbase14_448_paper.pth	45.6	70.6	68.2	65.2	0.358	56.0	94.7

*Model reported in the paper and trained using an earlier (internal) version of this codebase.

Architecture	Resolution	Checkpoint	Sem.Seg. ADE20K	Sem.Seg. CityScapes	Sem.Seg. NYU	Sem.Seg. Scannet	Depth NYUd	BEDLAM (PA-PVE)	MapFree (AUC)
ViT-Small/14 (110MB)	336	dune_vitsmall14_336.pth	39.6	61.7	63.5	60.1	0.424	WIP	WIP
ViT-Small/14	448	dune_vitsmall14_448.pth	41.4	63.7	65.5	61.2	0.404	69.0	94.5

WIP: Work in progress.
Semantic segmentation results in the table (Sem.Seg.) are obtained after DINOv2 projectors, following the convention in the paper.

To load a pretrained model, you can either clone this repository and download a pre-trained model:

from model.dune import load_dune_from_checkpoint
model = load_dune_from_checkpoint("./dune_vitbase14_448_paper.pth")

or use torch.hub directly (see hubconf.py for all available models):

# full model with projectors and teacher norms
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper")
# just the ViT encoder part of the model
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper_encoder")

Installation

Conda environment

Create a conda environment with all the necessary packages.

env_name="dune"
conda create -n ${env_name}
conda activate ${env_name}
conda install python=3.12
pip install -U torch=='2.7.0' torchvision torchfix timm 'huggingface_hub>=0.22' transformers accelerate einops torchmetrics optuna tensorboard matplotlib pandas jaxtyping scikit-learn-intelex omegaconf opencv-python ipython black flake8 pylint rich ipykernel

Set the path of your conda in scripts/setup_env.sh, i.e. update the conda_dir variable. Then your environment will be automatically used by the training script.

Teacher models

To download the teacher models we used in this work, you can check the bash scripts under the scripts/teachers folder. To download all teachers at once, use scripts/teachers/prepare_all.sh:

# BEFORE EXECUTING THIS COMMAND, MAKE SURE TO SEE THE CONTENTS OF THE SCRIPTS!
(cd scripts/teachers && ./prepare_all.sh <path_to_download_directory>)

Once teacher checkpoints are downloaded, update the ckpt_path keys in the TEACHER_CFG dictionary in teachers/config.py to point to the correct paths. For MASt3R, the preparation script mentioned above will additionally clone the MASt3R repository. You also need to set the code_dir in TEACHER_CFG key to point to the directory where this MASt3R repo is located.

Distillation datasets

We train DUNE models on the combination of 19 datasets with roughly 20.7M images. The full list of datasets is available below, and in Table 5 of the paper. We provide the dataloaders for these datasets in data/dino2.py, data/mast3r.py and data/multihmr.py for details. However, we leave the downloading and preprocessing of the datasets to the user. Once you have the datasets, set their paths in data/paths.py

If downloading the 19 datasets is too cumbersome, it is also possible to train DUNE on ImageNet-1K only. To do that, set the IN1K_DIRS variable in data/paths.py to the path of your ImageNet-1K.

Table of datasets

Name	Size	Nature
ImageNet-19K	13,153,480	Real
Mapillary	1,205,907	Real
Google Landmarks v2	4,132,914	Real
Habitat	284,968	Rendered
ARKitScenes	456,108	Rendered
Blended MVS	98,937	Rendered
MegaDepth	36,949	Real
ScanNet++	60,188	Rendered
CO3D-v2	185,100	Real
Map-free	41,300	Real
WildRgb	224,400	Real
VirtualKitti	1,200	Synthetic
Unreal4K	14,386	Synthetic
TartanAir	136,225	Real
DL3DV	208,800	Rendered
BEDLAM	353,118	Synthetic
AGORA	14,314	Synthetic
CUFFS	54,944	Synthetic
UBody	54,234	Real

Training models

DUNE follows a two-stage training: Initial pre-training at resolution 336 for 100 "epochs", and fine-tuning at resolution 448 for 50 epochs. We define an epoch by 1281167 images, as the size of the ImageNet-1K dataset.

# Pre-training at resolution 336
# To distill only on ImageNet-1K, pass --dataset="in1k" to the script
output_dir_pretrain="/path/to/dune/pretrain/dir/"
bash ./scripts/train.sh ${output_dir_pretrain}

# Fine-tuning at resolution 448
# Adjust batch size according to your GPU memory
output_dir_finetune="/path/to/dune/finetune/dir/"
bash ./scripts/train.sh ${output_dir_finetune} \
    --pretrained=${output_dir_pretrain}/checkpoint.pth \
    --image_size=448 \
    --lr=5e-5 \
    --epochs=50 \
    --batch_size_per_gpu=128

Evaluating models

We do not provide evaluation scripts for the downstream tasks. We kindly ask you to refer to the original repositories of the downstream tasks for evaluation:

PCA visualization

We provide an example script scripts/pca_vis.py, which shows how to load the encoder part of the DUNE model and visualize the PCA output of its patch features. To execute this script:

PYTHONPATH=${PWD}:${PYTHONPATH} python scripts/pca_vis.py

This will generate a PCA visualization of the patch features of the best DUNE model reported in the paper on a test image.

Citation

If you find this repository useful, please consider citing us:

@inproceedings{sariyildiz2025dune,
    title={{DUNE}: Distilling a Universal Encoder from Heterogeneous {2D} and {3D} Teachers},
    author={Sariyildiz, Mert Bulent and Weinzaepfel, Philippe and Lucas, Thomas and De Jorge, Pau and Larlus, Diane and Kalantidis, Yannis},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
model		model
scripts		scripts
teachers		teachers
utils		utils
.gitignore		.gitignore
ACKNOWLEDGEMENTS - NLE DUNE.txt		ACKNOWLEDGEMENTS - NLE DUNE.txt
Makefile		Makefile
Project NLE DUNE LICENSE.txt		Project NLE DUNE LICENSE.txt
README.md		README.md
hubconf.py		hubconf.py
main_dune.py		main_dune.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Easy navigation

Summary

Model overview

Pre-trained models

Installation

Conda environment

Teacher models

Distillation datasets

Table of datasets

Training models

Evaluating models

PCA visualization

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

naver/dune

Folders and files

Latest commit

History

Repository files navigation

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Easy navigation

Summary

Model overview

Pre-trained models

Installation

Conda environment

Teacher models

Distillation datasets

Table of datasets

Training models

Evaluating models

PCA visualization

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages