This repository is the official implementation of Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment, led by
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but show notable drawbacks:
- Blur and artifacts when pushed to magnify beyond its training regime
- High computational costs and inefficiency of retraining models when we want to magnify further
This brings us to the fundamental question:
How can we effectively utilize super-resolution models to explore much higher resolutions than they were originally trained for?
We address this via Chain-of-Zoom 🔎, a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a prompt extractor VLM. This prompt extractor can be fine-tuned through GRPO with a critic VLM to further align text guidance towards human preference.
- [May 2025] Code and paper are uploaded.
First, create your environment. We recommend using the following commands.
git clone https://github.com/bryanswkim/Chain-of-Zoom.git
cd Chain-of-Zoom
conda create -n coz python=3.10
conda activate coz
pip install -r requirements.txt
Models | Checkpoints |
---|---|
Stable Diffusion v3 | Hugging Face |
Qwen2.5-VL-3B-Instruct | Hugging Face |
RAM | Hugging Face |
You can quickly check the results of using CoZ with the following example:
python inference_coz.py \
-i samples \
-o inference_results/coz_vlmprompt \
--rec_type recursive_multiscale \
--prompt_type vlm \
--lora_path ckpt/SR_LoRA/model_20001.pkl \
--vae_path ckpt/SR_VAE/vae_encoder_20001.pt \
--pretrained_model_name_or_path 'stabilityai/stable-diffusion-3-medium-diffusers' \
--ram_ft_path ckpt/DAPE/DAPE.pth \
--ram_path ckpt/RAM/ram_swin_large_14m.pth;
Which will give a result like below:
Using --efficient_memory
allows CoZ to run on a single GPU with 24GB VRAM, but highly increases inference time due to offloading.
We recommend using two GPUs.
If you find our method useful, please cite as below or leave a star to this repository.
@article{kim2025chain,
title={Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment},
author={Kim, Bryan Sangwoo and Kim, Jeongsol and Ye, Jong Chul},
journal={arXiv preprint arXiv:2505.18600},
year={2025}
}
We thank the authors of OSEDiff for sharing their awesome work!
Note
This work is currently in the preprint stage, and there may be some changes to the code.