Official implementation of CVPR 2025 paper:
"SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction"
Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, Juergen Gall
To train your model you can use predefined config files or define custom ones.
You will need to follow the next steps:
python3 main.py --config configs/run/train/ae_city_rgb.yaml
(Optional) GAN fine-tuning: A few iterations may increase autoencoder reconstruction performance.
This can already be used as a standalone model for video prediction.
python3 main.py --config configs/run/train/ddpm_city_rgb.yaml
You can either initialize this with pre-trained modality specific diffusion models or train it from scratch, we recommend the first option as discussed in the paper.
python3 main.py --config configs/run/train/sync_city.yaml
python3 main.py --config configs/run/eval/sync_city.yaml
Cityscapes autoencoders and multi-modal model checkpoints can be downloaded using:
bash download.sh
Preprocessed version of Cityscapes at 128x128 resolution with disparity (depth) maps can be downloaded here.
- Non 1:1 aspect ratio implementation
- Full evaluation code release
- Training code released
@InProceedings{Pallotta_2025_CVPR,
author = {Pallotta, Enrico and Azar, Sina Mokhtarzadeh and Li, Shuai and Zatsarynna, Olga and Gall, Juergen},
title = {SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {13787-13797}
}
This repository is mainly based on the PVDM codebase.