This is the official implementation of the paper VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate. We exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. We propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Experiments show that VGDFR can achieve a speedup up to 3x for video generation.
A cute girl with red hair. Realistic, Natural lighting, Casual | A fluffy dog with a joyful expression bounds through a snowy landscape under a soft blue sky. Snowflakes gently fall as the dog rolls, pounces into snowdrifts, and digs energetically. Occasionally, the dog pauses, wagging its tail and looking back at the camera, inviting you to play. The surroundings feature snow-covered trees, frosted bushes, and a serene winter backdrop. The video is lighthearted, with soft, playful background music enhancing the happy and lively atmosphere. Realistic, Natural lighting | |
---|---|---|
Raw | raw_seed3_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4 |
raw_seed3_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4 |
90% tokens | 0.9_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4 |
0.9_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4 |
80% tokens | 0.8_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4 |
0.8_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4 |
70% tokens | 0.7_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4 |
0.7_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4 |
60% tokens | 0.6_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4 |
0.6_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4 |
# 0. Install miniconda3
# 1. Create conda environment
conda create -n vgdfr python==3.10
# 2. Activate the environment
conda activate vgdfr
# 3. Install PyTorch and other dependencies using conda
# For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# 4. Install pip dependencies
python -m pip install -r requirements.txt
# 5. Install flash attention for acceleration
conda install cuda-nvcc
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.7.4
# 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
python -m pip install xfuser==0.4.3
# 7. Download flownet.pkl to data directory
wget https://github.com/thu-nics/VGDFR/releases/download/v0.1/flownet.pkl -P data/
# 8. Download the pre-trained HunyuanVideo model to ckpts directory (ref to https://github.com/Tencent/HunyuanVideo/tree/main/ckpts)
huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
cd ckpts
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./llava-llama-3-8b-v1_1-transformers
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./text_encoder_2
cd ../
python hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/text_encoder
There is a ipython notebook example experiments/example.ipynb
in the root directory. You can run it in Jupyter Notebook.
The following is a simple example of how to use VGDFR to generate videos.
from VGDFR.hunyuan_vgdfr import VGDFRHunyuanVideoSampler
hunyuan_video_sampler = VGDFRHunyuanVideoSampler.from_pretrained(models_root_path, args=args)
hunyuan_video_sampler.pipeline.schedule_mode = "keep_token_ratio"
hunyuan_video_sampler.pipeline.before_compression_steps = 5
hunyuan_video_sampler.pipeline.keep_token_ratio = 0.8
samples = hunyuan_video_sampler.predict(
prompt=prompt,
height=height,
width=width,
video_length=video_length,
seed=seed,
negative_prompt=args.neg_prompt,
infer_steps=args.infer_steps,
guidance_scale=args.cfg_scale,
num_videos_per_prompt=args.num_videos,
flow_shift=args.flow_shift,
batch_size=args.batch_size,
embedded_guidance_scale=args.embedded_cfg_scale,
)["samples"]
We provide two kinds of schedule methods: keep_token_ratio
and similarity_threshold
. The keep_token_ratio
method is used to control the token compression ratio of generating, while the similarity_threshold
method is used to set the minimum similarity threshold to compress adjacent frames.
The before_compression_steps
parameter is used to set the number of steps before the compression starts.
This code is built upon the following open-source projects:
If you find this code useful in your research, please consider citing our paper:
@misc{yuan2025vgdfrdiffusionbasedvideogeneration,
title={VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate},
author={Zhihang Yuan and Rui Xie and Yuzhang Shang and Hanling Zhang and Siyuan Wang and Shengen Yan and Guohao Dai and Yu Wang},
year={2025},
eprint={2504.12259},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.12259},
}