[Paper] | [Project Page] | [Demo]
We achieves high-quality first-frame guided video editing given a reference image (top row), while maintaining flexibility for incorporating additional reference conditions (bottom row).- [2025.06.15] Additional edited frames as guidance is now available! π
- [2025.06.07] LoRA-Edit first-frame-guided-editing code is now available! π
-
Unlike video editing techniques such as VACE, our method itself does not provide visual editing capabilities. Instead, we leverage powerful image editing models to edit the first frame (or more frames), thereby transferring image editing capabilities to video editing.
-
Our project curr 8000 ently runs at a moderate speed, taking 30-50 minutes to edit 49 frames on RTX 4090. We are actively working on optimizations (A faster version will be available this month).
-
The ComfyUI version is currently facing challenges as we haven't found a suitable node for training Wan I2V LoRA in ComfyUI.
-
We welcome your issues and contributions to our project!
- Upload first-frame-guided video editing code (Completed)
- Upload additional reference code (
Expected before UTC 2025.06.15, Completed) - Upload detailed memory and speed report on RTX 4090 (
Expected before UTC 2025.06.15, Completed) - Optimize memory usage and speed
- CUDA-compatible GPU with sufficient VRAM (We use a single GeForce RTX 4090 (24GB))
- Python 3.12 (recommended)
- Git
- Miniconda or Anaconda
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/cjeen/LoRAEdit.git
cd LoRAEdit
# If you already cloned without submodules, run:
# git submodule init
# git submodule update
Install PyTorch compatible with your CUDA version. Check your CUDA version with nvcc -V
and choose the appropriate installation command from PyTorch official website.
Examples for common CUDA versions:
# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install Python dependencies
pip install -r requirements.txt
# Install huggingface_hub if not already installed
pip install huggingface_hub
# Download the Wan2.1-I2V model
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./Wan2.1-I2V-14B-480P
# Create models directory
mkdir -p models_sam
# Download SAM2 large model (recommended)
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt -O models_sam/sam2_hiera_large.pt
# Alternative: Download other SAM2 models if needed
# SAM2 Base+: wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_base_plus.pt -O models_sam/sam2_hiera_base_plus.pt
# SAM2 Small: wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_small.pt -O models_sam/sam2_hiera_small.pt
# SAM2 Tiny: wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt -O models_sam/sam2_hiera_tiny.pt
Watch this quick tutorial to see how to use the data preprocessing interface:
tutorial_compressed.mp4
Note: A new tutorial video covering additional reference frames will be available soon.
Launch the data preprocessing interface:
python predata_app.py --port 8890 --checkpoint_dir models_sam/sam2_hiera_large.pt
After preprocessing, use the generated training command (example):
NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" deepspeed --num_gpus=1 train.py --deepspeed --config ./processed_data/your_sequence/configs/training.toml
The table below summarizes the training speed and memory usage for different numbers of frames at 480P (832Γ480) resolution on an RTX 4090, helping you estimate the resource requirements for your own experiments.
All results on our project page are obtained by training for 100 steps under the 49-frame setting at 480P (832Γ480).
Number of Frames | Time per Iteration (sec) | Memory Usage (MB) |
---|---|---|
5 | 7.55 | 11,086 |
13 | 10.81 | 12,496 |
21 | 14.79 | 14,456 |
49 | 31.88 | 21,522 |
65Β β | 45.71 | 20,416 |
β For 65 frames, blocks_to_swap
was set to 38 instead of the default 32.
After training completes, run inference:
# Save your edited first frame as edited_image.png (or .jpg) in the data directory
# Then run inference
python inference.py --model_root_dir ./Wan2.1-I2V-14B-480P --data_dir ./processed_data/your_sequence
For more precise control using multiple edited frames as reference:
# 1. Put your edited frames from source_frames to additional_edited_frames directory
# Edit frames from ./processed_data/your_sequence/source_frames/
# Save edited frames to ./processed_data/your_sequence/additional_edited_frames/
# Important: Keep the same filename (e.g., 00000.png, 00001.png, etc.)
# 2. Preprocess additional data
python predata_additional.py --data_dir ./processed_data/your_sequence
# 3. Train additional LoRA (much faster than previous LoRA training)
NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" deepspeed --num_gpus=1 train.py --deepspeed --config ./processed_data/your_sequence/configs/training_additional.toml
# 4. Run inference with additional frames guidance
python inference.py --model_root_dir ./Wan2.1-I2V-14B-480P --data_dir ./processed_data/your_sequence --additional
project_root/
βββ predata_app.py # Data preprocessing interface
βββ train.py # LoRA training script
βββ inference.py # Video generation inference
βββ models_sam/ # SAM2 model checkpoints
β βββ sam2_hiera_large.pt
βββ Wan2.1-I2V-14B-480P/ # Wan2.1 model directory
βββ processed_data/ # Processed training data
β βββ your_sequence/
β βββ source_frames/ # Original frames for editing
β βββ additional_edited_frames/ # Your edited frames for additional reference
β βββ traindata/ # Training videos and captions
β βββ configs/ # Training configuration files
β βββ lora/ # Trained LoRA checkpoints
β βββ inference_rgb.mp4 # Preprocessed RGB video
β βββ inference_mask.mp4 # Mask video
β βββ edited_image.png # Your edited first frame
βββ requirements.txt
We would like to express our sincere gratitude to Wan2.1 for open-sourcing their powerful Image-to-Video model, which serves as the foundation for our work.
This project is built upon diffusion-pipe by tdrussell. We gratefully acknowledge their excellent work in providing a solid foundation for memory-efficient training of diffusion models.
The SAM2 GUI interface in this project references code from SAM2-GUI by YunxuanMao. We thank them for their contribution to the SAM2 community with their intuitive interface design.