Christopher Wewer, Bart Pogodzinski, Bernt Schiele, Jan Eric Lenssen
Max Planck Institute for Informatics, Saarland Informatics Campus
- [25-05-01] 🎉 Spatial Reasoning with Denoising Models is accepted at ICML 2025! Meet us at our poster! 😁
- [25-03-03] 🚀 Code is available on GitHub. Note that this is a minimal code example to reproduce paper results. We plan to release a comprehensive toolbox for our framework soon. Stay tuned!
- [25-03-03] 👀 Release of arXiv paper and project website.
We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.
To get started, create a virtual environment using Python 3.12+:
python3.12 -m venv srm
source srm/bin/activate
pip install -r requirements.txt
We provide the relevant files for the datasets as part of our releases here.
Please extract the datasets.zip
in the project root directory or modify the root path of the dataset config files in config/dataset
.
For counting polygons on FFHQ background, please download FFHQ first and provide the path in config/dataset/counting_polygons_ffhq.yaml
.
We provide checkpoints of all trained models in our releases here. Simply download all and extract them in the project root directory.
We have two different settings for debugging (running offline and including typechecking at runtime) and fast training (including torch.compile and wandb logging) and sampling (deactivated typechecking). Use [debug_](train | test).sh
for training/testing with/without debugging mode.
Start training via train.sh
like:
bash train.sh [experiment config name] [optional experiment id] [optional hydra overrides]
, where
- experiment config name is the file name of the experiment config in
config/experiment
without extension, - experiment id (datetime as default) is the optional id of a previous training run to resume (given in
outputs/[experiment config name]/[experiment id]
), and - hydra overrides for individual hyperparameters can be specified as described here.
The training code will automatically run in distributed mode on all available GPUs, if there are multiple.
To run evaluation, use test.sh
like:
bash test.sh [experiment config name] [experiment id] [test config name] [optional hydra overrides]
, where all arguments are the same as for training except for test config name being the file name of the test config in config/test
without extension. Note that the test script loads the checkpoints from outputs/[experiment config name]/[experiment id]/checkpoints/last.ckpt
. Evaluation outputs are stored in outputs/[experiment config name]/[experiment id]/test
.
For example, after downloading our datasets and checkpoints, run the following command for our best setup on the hard difficulty of the MNIST Sudoku dataset:
bash test.sh ms1000_28 paper ms_hard_seq_adaptive000
When using this code in your project, consider citing our work as follows:
@inproceedings{wewer25srm,
title = {Spatial Reasoning with Denoising Models},
author = {Wewer, Christopher and Pogodzinski, Bartlomiej and Schiele, Bernt and Lenssen, Jan Eric},
booktitle = {International Conference on Machine Learning ({ICML})},
year = {2025},
}
This project was partially funded by the Saarland/Intel Joint Program on the Future of Graphics and Media. We thank Thomas Wimmer for proofreading and helpful discussions.