AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement. Despite advancements in pose-guided human video generation, creating product promotion videos remains challenging. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances ob 8000 ject appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Extensive experiments show that our system improves object appearance preservation by 7.5% and doubles the object localization accuracy compared to existing state-of-the-art approaches. It also outperforms existing approaches in maintaining human motion consistency and high-quality video generation.
[2025.06.17] We have open-sourced the training and inference code, along with the test dataset. The training dataset is available upon request.
[2025.04.17] We have released gradio demo.
conda create -name anchorcrafter python==3.11
pip install -r requirements.txt
- Download DWPose model and place them at ./models/DWPose.
wget https://huggingface.co/yzd-v/DWPose/resolve/main/yolox_l.onnx?download=true -O models/DWPose/yolox_l.onnx
wget https://huggingface.co/yzd-v/DWPose/resolve/main/dw-ll_ucoco_384.onnx?download=true -O models/DWPose/dw-ll_ucoco_384.onnx
- Download Dinov2-large model and place them at ./models/dinov2_large.
- Download SVD model and place them at ./models/stable-video-diffusion-img2vid-xt-1-1.
- You need to modify the "in_channels" parameter in your unet/config.json file.
in_channels: 8 => in_channels: 12
- You can download the AnchorCrafter_1.pth and place them at ./models/. This model has been fine-tuned on finutune dataset (five test objects).
Finally, all the weights should be organized in models as follows
models/
├── DWPose
│ ├── dw-ll_ucoco_384.onnx
│ └── yolox_l.onnx
├── dinov2_large
│ ├── pytorch_model.bin
│ ├── config.json
│ └── preprocessor_config.json
├── stable-video-diffusion-img2vid-xt-1-1
└── AnchorCrafter_1.pth
A sample configuration for testing is provided at ./config. You can also easily modify the various configurations according to your needs.
sh inference.sh
We provide training scripts. Please download the finutune dataset AnchorCrafter-finutune and place them at ./dataset/tune/.
dataset/tune/
├── depth_cut
├── hand_cut
├── masked_object_cut
├── people_cut
├── video_pose
└── video_cut
Download the non-finetuned weights and place them at ./models/. The training code can be executed as:
sh train.sh
We use DeepSeed to enable multi-GPU training, requiring at least 5 GPUs with 40GB of VRAM each. Some parameters should be filled with your configuration in the sh train.sh.
We have released the test dataset AnchorCrafter-test, which includes five objects and eight human images, with each object featuring two different poses.
We have collected and made available for application a fundamental HOI training dataset, AnchorCrafter-400, which comprises 400 videos. It is designed for academic research. If you wish to apply for its usage, please fill out the questionnaire.
@article{xu2024anchorcrafter,
title={AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation},
author={Xu, Ziyi and Huang, Ziyao and Cao, Juan and Zhang, Yong and Cun, Xiaodong and Shuai, Qing and Wang, Yuchen and Bao, Linchao and Li, Jintao and Tang, Fan},
journal={arXiv preprint arXiv:2411.17383},
year={2024}
}
Here are some great resources we benefit: Diffusers, Stability-AI , MimicMotion, SVD_Xtend