Mert Bulent Sariyildiz · Philippe Weinzaepfel · Thomas Lucas · Pau de Jorge · Diane Larlus · Yannis Kalantidis
NAVER LABS Europe
CVPR 2025
- Model overview
- Pre-trained models
- Installation
- Training models
- Evaluating models
- PCA visualization
- Citation
DUNE is a vision encoder trained via multi-teacher distillation. Specifically, DUNE encoders are distilled using three heterogeneous pre-trained vision models as teachers: DINOv2, MASt3R and Multi-HMR. We use 19 datasets for distillation, covering the visual domains of all three teachers and comprising approximately 20.7 million images in total. The full list of datasets is provided below and also provided in Table 5 of the paper. For all teachers, we used their publicly available ViT-Large checkpoints.
By using DUNE, you can achieve strong performance on a range of 2D and 3D downstream tasks, like monocular depth, semantic segmentation, multi-view depth estimation, multi-human mesh recovery, multi-view pose regression and 3D reconstruction. Notably, MASt3R model with a DUNE encoder achieves a new state-of-the art performance in Map-free Visual Relocalization, improving over the original MASt3R with a much smaller encoder.
Architecture | Resolution | Checkpoint | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |
---|---|---|---|---|---|---|---|---|---|
ViT-Base/14 (420MB) | 336 | dune_vitbase14_336.pth | 45.0 | 69.3 | 66.9 | 64.6 | 0.384 | 64.3 | 94.1 |
ViT-Base/14 | 448 | dune_vitbase14_448.pth | 46.2 | 71.3 | 68.3 | 65.4 | 0.365 | 60.1 | 94.2 |
ViT-Base/14* | 448 | dune_vitbase14_448_paper.pth | 45.6 | 70.6 | 68.2 | 65.2 | 0.358 | 56.0 | 94.7 |
- *Model reported in the paper and trained using an earlier (internal) version of this codebase.
Architecture | Resolution | Checkpoint | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |
---|---|---|---|---|---|---|---|---|---|
ViT-Small/14 (110MB) | 336 | dune_vitsmall14_336.pth | 39.6 | 61.7 | 63.5 | 60.1 | 0.424 | WIP | WIP |
ViT-Small/14 | 448 | dune_vitsmall14_448.pth | 41.4 | 63.7 | 65.5 | 61.2 | 0.404 | 69.0 | 94.5 |
- WIP: Work in progress.
- Semantic segmentation results in the table (Sem.Seg.) are obtained after DINOv2 projectors, following the convention in the paper.
To load a pretrained model, you can either clone this repository and download a pre-trained model:
from model.dune import load_dune_from_checkpoint
model = load_dune_from_checkpoint("./dune_vitbase14_448_paper.pth")
or use torch.hub
directly (see hubconf.py for all available models):
# full model with projectors and teacher norms
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper")
# just the ViT encoder part of the model
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper_encoder")
- Create a conda environment with all the necessary packages.
env_name="dune"
conda create -n ${env_name}
conda activate ${env_name}
conda install python=3.12
pip install -U torch=='2.7.0' torchvision torchfix timm 'huggingface_hub>=0.22' transformers accelerate einops torchmetrics optuna tensorboard matplotlib pandas jaxtyping scikit-learn-intelex omegaconf opencv-python ipython black flake8 pylint rich ipykernel
- Set the path of your conda in scripts/setup_env.sh, i.e. update the
conda_dir
variable. Then your environment will be automatically used by the training script.
- To download the teacher models we used in this work, you can check the bash scripts under the scripts/teachers folder. To download all teachers at once, use scripts/teachers/prepare_all.sh:
# BEFORE EXECUTING THIS COMMAND, MAKE SURE TO SEE THE CONTENTS OF THE SCRIPTS!
(cd scripts/teachers && ./prepare_all.sh <path_to_download_directory>)
- Once teacher checkpoints are downloaded, update the
ckpt_path
keys in theTEACHER_CFG
dictionary in teachers/config.py to point to the correct paths. For MASt3R, the preparation script mentioned above will additionally clone the MASt3R repository. You also need to set thecode_dir
inTEACHER_CFG
key to point to the directory where this MASt3R repo is located.
We train DUNE models on the combination of 19 datasets with roughly 20.7M images. The full list of datasets is available below, and in Table 5 of the paper. We provide the dataloaders for these datasets in data/dino2.py, data/mast3r.py and data/multihmr.py for details. However, we leave the downloading and preprocessing of the datasets to the user. Once you have the datasets, set their paths in data/paths.py
If downloading the 19 datasets is too cumbersome, it is also possible to train DUNE on ImageNet-1K only.
To do that, set the IN1K_DIRS
variable in data/paths.py to the path of your ImageNet-1K.
Name | Size | Nature |
---|---|---|
ImageNet-19K | 13,153,480 | Real |
Mapillary | 1,205,907 | Real |
Google Landmarks v2 | 4,132,914 | Real |
Habitat | 284,968 | Rendered |
ARKitScenes | 456,108 | Rendered |
Blended MVS | 98,937 | Rendered |
MegaDepth | 36,949 | Real |
ScanNet++ | 60,188 | Rendered |
CO3D-v2 | 185,100 | Real |
Map-free | 41,300 | Real |
WildRgb | 224,400 | Real |
VirtualKitti | 1,200 | Synthetic |
Unreal4K | 14,386 | Synthetic |
TartanAir | 136,225 | Real |
DL3DV | 208,800 | Rendered |
BEDLAM | 353,118 | Synthetic |
AGORA | 14,314 | Synthetic |
CUFFS | 54,944 | Synthetic |
UBody | 54,234 | Real |
DUNE follows a two-stage training: Initial pre-training at resolution 336 for 100 "epochs", and fine-tuning at resolution 448 for 50 epochs. We define an epoch by 1281167 images, as the size of the ImageNet-1K dataset.
# Pre-training at resolution 336
# To distill only on ImageNet-1K, pass --dataset="in1k" to the script
output_dir_pretrain="/path/to/dune/pretrain/dir/"
bash ./scripts/train.sh ${output_dir_pretrain}
# Fine-tuning at resolution 448
# Adjust batch size according to your GPU memory
output_dir_finetune="/path/to/dune/finetune/dir/"
bash ./scripts/train.sh ${output_dir_finetune} \
--pretrained=${output_dir_pretrain}/checkpoint.pth \
--image_size=448 \
--lr=5e-5 \
--epochs=50 \
--batch_size_per_gpu=128
We do not provide evaluation scripts for the downstream tasks. We kindly ask you to refer to the original repositories of the downstream tasks for evaluation:
- https://github.com/facebookresearch/dinov2
- https://github.com/naver/mast3r
- https://github.com/naver/multi-hmr
We provide an example script scripts/pca_vis.py, which shows how to load the encoder part of the DUNE model and visualize the PCA output of its patch features. To execute this script:
PYTHONPATH=${PWD}:${PYTHONPATH} python scripts/pca_vis.py
This will generate a PCA visualization of the patch features of the best DUNE model reported in the paper on a test image.
If you find this repository useful, please consider citing us:
@inproceedings{sariyildiz2025dune,
title={{DUNE}: Distilling a Universal Encoder from Heterogeneous {2D} and {3D} Teachers},
author={Sariyildiz, Mert Bulent and Weinzaepfel, Philippe and Lucas, Thomas and De Jorge, Pau and Larlus, Diane and Kalantidis, Yannis},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025},
}