Sihui Ji
·
Hao Luo
·
Xi Chen
·
Yuanpeng Tu
·
Yiyang Wang
·
Hengshuang Zhao
The University of Hong Kong | DAMO Academy, Alibaba Group | Hupan Lab
|
- [2025.06.17]: Release the inference code.
- [2025.06.04]: Release the project page and the arxiv paper.
- [2025.03.29]: LayerFLow is accepted by Siggragh 2025 🎉🎉🎉.
TL;DR: We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa.
- Inference code
- Model checkpoints
- Training code
Begin by cloning the repository:
git clone https://github.com/SihuiJi/LayerFlow.git
cd LayerFlow
Our project is developed based on the SAT version code of CogVideoX. You can follow the instructions of CogVideoX to install dependencies or:
conda create -n layer python==3.10
conda activate layer
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
pip install -r requirements.txt
Models | Download Link (RGB version) | Download Link (RGBA version) |
---|---|---|
Multi-layer generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
Multi-layer decomposition | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
Foreground-conditioned generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
Background-conditioned generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
💡Note:
- All models are finetuned from CogVideoX-2B.
- RGB version represents the models generating foreground layer without alpha-matte, while the model of RGBA version simultaneously generate foreground videos and its alpha-matte which can be combined into RGBA videos. However, due to difficulties in cross-domain generation and channel alignment, the results are generally less stable compared to RGB version.
Download models using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download zjuJish/LayerFlow --local-dir ./sat/ckpts_2b_lora
or using git:
git lfs install
git clone https://huggingface.co/zjuJish/LayerFlow
For the pretrained VAE from CogVideoX-2B model, download as follows:
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
Since model weight files are large, it’s recommended to use git lfs
.
See here for git lfs
installation.
git lfs install
Next, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning.
git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_e
6E67
ncoder/* CogVideoX-2b/tokenizer/* CogVideoX-2b-sat/t5-v1_1-xxl
You may also use the model file location on Modelscope.
Arrange the above model files in the following structure:
CogVideoX-2b-sat
│
├── t5-v1_1-xxl
│ ├── added_tokens.json
│ ├── config.json
│ ├── model-00001-of-00002.safetensors
│ ├── model-00002-of-00002.safetensors
│ ├── model.safetensors.index.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ └── tokenizer_config.json
└── vae
└── 3d-vae.pt
sat
│
├── ckpts_2b_lora
│ ├── multi-layer-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── multi-layer-decomposition
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── foreground-conditioned-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── background-conditioned-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
cd sat
Run multi-layer generation (RGB version)
bash 'inference_stage2_gen_rgb.sh'
Run multi-layer generation (RGBA version)
bash 'inference_stage2_gen_rgba.sh'
Run multi-layer decomposition (RGB version)
bash 'inference_stage2_seg_rgb.sh'
Run multi-layer decomposition (RGBA version)
bash 'inference_stage2_seg_rgba.sh'
Run foreground-conditioned generation (RGB version)
bash 'inference_stage2_fg2bg_rgb.sh'
Run foreground-conditioned generation (RGBA version)
bash 'inference_stage2_fg2bg_rgba.sh'
Run background-conditioned generation (RGB version)
bash 'inference_stage2_bg2fg_rgb.sh'
Run background-conditioned generation (RGBA version)
bash 'inference_stage2_bg2fg_rgba.sh'
This project is developed on the codebase of CogVideoX. We appreciate this great work!
Please leave us a star 🌟 and cite our paper if you find our work helpful.
@article{ji2025layerflow,
title={LayerFlow : A Unified Model for Layer-aware Video Generation},
author={Ji, Sihui and Luo, Hao and Chen, Xi and Tu, Yuanpeng and Wang, Yiyang and Zhao, Hengshuang},
year={2025},
journal={arXiv preprint arXiv:2506.04228},
}