Important: We have made the weights and code for STAR available in a new repository. Click here to access it!
- [2025-02] We have released official Codebase and weights at Hugging Face!
- [2024-06] STAR Technical Report is released.
STAR, the first scale-wise text-to-image model based on VAR, supports resolutions from 256×256 to 1024×1024.
By incorporating text conditioning, normalized 2D RoPE, and causal-driven stable sampling, STAR outperforms existing models in fidelity, consistency, and quality, with a faster generation speed of 2.21s for 1024×1024 images on an A100.
CLICK for Detailed Introduction & Architecture
Unlike VAR, which focuses on a toy category-based auto-regressive generation for 256 images, STAR explores the potential of this scale-wise auto-regressive paradigm in real-world scenarios, aiming to make AR as effective as diffusion models. To achieve this, we: + replace the single category token with a text encoder and cross-attention for detailed text guidance; + introduce cross-scale normalized RoPE to stabilize structural learning and reduce training costs, unleasing the power for high-resolution training; + propose a new sampling method to overcome the intrinsic simultaneous sampling issue in AR models. While the 6983 se approaches have been (partially) explored to diffusion models, we are the first to validate and apply them in auto-regressive image generation, resulting in high-resolution, text-conditioned synthesis and can get StableDiffusion 2 performance.
Per-category FID on MJHQ-30K |
Efficiency & CLIP-Score of 1024x1024 generation |
See Repo for detailes.
Thanks to the developers of Visual Autoregressive Modeling for their excellent work. Our code is adapted from VAR. If our work assists your research, feel free to give us a star ⭐ or cite us using:
@article{ma2024star,
title={STAR: Scale-wise Text-conditioned AutoRegressive image generation},
author={Xiaoxiao Ma and Mohan Zhou and Tao Liang and Yalong Bai and Tiejun Zhao and Biye Li and Huaian Chen and Yi Jin},
journal={arXiv preprint arXiv:2406.10797},
year={2024}
}