Ziyu Yao1,2, Jialin Li2, Yifeng Zhou2, Yong Liu2, Xi Jiang2,3, Chengjie Wang2, Feng Zheng3, Yuexian Zou1, Lei Li4
1 Peking University, 2 Tencent Youtu Lab, 3 Southern University of Science and Technology, 4 University of Washington
We have currently released the CAR-d16 weights for demo purposes, and larger models will be made available following future upgrades and extensions of CAR.
The CAR models are available on and can also be downloaded from the following links:
Model | reso. | Condition | HF weights🤗 |
---|---|---|---|
CAR-d16 | 256 | Canny Edge | car_canny_d16.pth |
CAR-d16 | 256 | HED Map | car_hed_d16.pth |
CAR-d16 | 256 | Depth Map | car_depth_d16.pth |
CAR-d16 | 256 | Normal Map | car_normal_d16.pth |
CAR-d16 | 256 | Sketch | car_sketch_d16.pth |
As CAR is based on the pre-trained VAR model, the following pre-trained weights also need to be downloaded: vae_ch160v4096z32.pth, var_d16.pth.
The arg --data_path
should indicate the path to the ImageNet dataset.
You can choose to extract conditions from all categories or select a subset of 1000 categories for condition extraction. Run the following commands:
# canny
python extract_canny.py
# hed
python extract_hed.py
# depth
python extract_depth.py
# normal
python extract_normal.py
# sketch
python extract_sketch.py
# d16, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--data_path=/path/to/imagenet --condition_path=/path/to/condition/extract/above \
--vae_ckpt=/path/to/pretrained/vae/ckpt --pretrained_var_ckpt=/path/to/pretrained/var/ckpt \
--tblr=0.0001 --depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
# cls is an index ranging from 0 to 999 in the ImageNet label set
# type indicates which condition is extracted from the original image (canny, hed, depth, normal, sketch)
python inference.py --vae_ckpt=/path/to/pretrained/vae/ckpt --var_ckpt=/path/to/pretrained/var/ckpt \
--car_ckpt=/path/to/car/ckpt --img_path=/path/to/original/image/to/extract/condition \
--save_path=/path/to/save/image --cls=3 --type=hed
The development of CAR is based on VAR. We deeply appreciate this significant contribution to the community.
If you find our work helpful in your research, we would be grateful if you could consider giving us a star ⭐ or citing it using:
@article{yao2024car,
title={Car: Controllable autoregressive modeling for visual generation},
author={Yao, Ziyu and Li, Jialin and Zhou, Yifeng and Liu, Yong and Jiang, Xi and Wang, Chengjie and Zheng, Feng and Zou, Yuexian and Li, Lei},
journal={arXiv preprint arXiv:2410.04671},
year={2024}
}