8000 GitHub - thu-nics/VGDFR: VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ VGDFR Public

VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate

License

Notifications You must be signed in to change notification settings

thu-nics/VGDFR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate

This is the official implementation of the paper VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate. We exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. We propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Experiments show that VGDFR can achieve a speedup up to 3x for video generation.

🎥 Demo Videos

A cute girl with red hair. Realistic, Natural lighting, Casual A fluffy dog with a joyful expression bounds through a snowy landscape under a soft blue sky. Snowflakes gently fall as the dog rolls, pounces into snowdrifts, and digs energetically. Occasionally, the dog pauses, wagging its tail and looking back at the camera, inviting you to play. The surroundings feature snow-covered trees, frosted bushes, and a serene winter backdrop. The video is lighthearted, with soft, playful background music enhancing the happy and lively atmosphere. Realistic, Natural lighting
Raw
raw_seed3_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4
raw_seed3_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4
90% tokens
0.9_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4
0.9_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4
80% tokens
0.8_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4
0.8_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4
70% tokens
0.7_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4
0.7_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4
60% tokens
0.6_A.cute.girl.with.red.hair.Realistic.Natural.lighting.Casual.mp4
0.6_A.fluffy.dog.with.a.joyful.expression.bounds.through.a.snowy.landscape.under.a.soft.blue.sky.Snowfl.mp4

Installation

# 0. Install miniconda3

# 1. Create conda environment
conda create -n vgdfr python==3.10

# 2. Activate the environment
conda activate vgdfr

# 3. Install PyTorch and other dependencies using conda
# For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r requirements.txt

# 5. Install flash attention for acceleration
conda install cuda-nvcc
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.7.4

# 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
python -m pip install xfuser==0.4.3

# 7. Download flownet.pkl to data directory
wget https://github.com/thu-nics/VGDFR/releases/download/v0.1/flownet.pkl -P data/

# 8. Download the pre-trained HunyuanVideo model to ckpts directory (ref to https://github.com/Tencent/HunyuanVideo/tree/main/ckpts)
huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
cd ckpts
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./llava-llama-3-8b-v1_1-transformers
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./text_encoder_2
cd ../
python hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/text_encoder

Generation with VGDFR

There is a ipython notebook example experiments/example.ipynb in the root directory. You can run it in Jupyter Notebook.

The following is a simple example of how to use VGDFR to generate videos.

from VGDFR.hunyuan_vgdfr import VGDFRHunyuanVideoSampler
hunyuan_video_sampler = VGDFRHunyuanVideoSampler.from_pretrained(models_root_path, args=args)
hunyuan_video_sampler.pipeline.schedule_mode = "keep_token_ratio"
hunyuan_video_sampler.pipeline.before_compression_steps = 5
hunyuan_video_sampler.pipeline.keep_token_ratio = 0.8
samples = hunyuan_video_sampler.predict(
      prompt=prompt,
      height=height,
      width=width,
      video_length=video_length,
      seed=seed,
      negative_prompt=args.neg_prompt,
      infer_steps=args.infer_steps,
      guidance_scale=args.cfg_scale,
      num_videos_per_prompt=args.num_videos,
      flow_shift=args.flow_shift,
      batch_size=args.batch_size,
      embedded_guidance_scale=args.embedded_cfg_scale,
)["samples"]

We provide two kinds of schedule methods: keep_token_ratio and similarity_threshold. The keep_token_ratio method is used to control the token compression ratio of generating, while the similarity_threshold method is used to set the minimum similarity threshold to compress adjacent frames.

The before_compression_steps parameter is used to set the number of steps before the compression starts.

Ackowledgement and Citation

This code is built upon the following open-source projects:

If you find this code useful in your research, please consider citing our paper:

@misc{yuan2025vgdfrdiffusionbasedvideogeneration,
      title={VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate}, 
      author={Zhihang Yuan and Rui Xie and Yuzhang Shang and Hanling Zhang and Siyuan Wang and Shengen Yan and Guohao Dai and Yu Wang},
      year={2025},
      eprint={2504.12259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.12259}, 
}

About

VGDFR: Diffuison-based Video Generation with Dynamic Frame Rate

Resources

License

Stars

Watchers

Forks

Packages

No packages published
0