Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation.
2025/04/29
: We released the initial version of the inference code and models. Stay tuned for continuous updates!
Input | Neutral | Happy | Angry | Surprised |
---|---|---|---|---|
1_ne.mp4 |
1_ha.mp4 |
1_an.mp4 |
1_su.mp4 |
|
2_ne.mp4 |
2_ha.mp4 |
2_an.mp4 |
2_su.mp4 |
For more visual demos, please visit our Page.
- It is recommended to use a GPU with
20GB
or more VRAM and have an independentPython 3.10
. - Tested operating system:
Linux
ffmpeg
requires to be installed.PyTorch
: make sure to select the appropriate CUDA version based on your hardware, for example,
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
Dependencies
:
pip install -r requirements.txt
- All models are stored in
checkpoints
by default, and the file structure is as follows:
DICE-Talk
βββcheckpoints
β βββDICE-Talk
β β βββaudio_linear.pth
β β βββemo_model.pth
β β βββpose_guider.pth
β β βββunet.pth
β βββstable-video-diffusion-img2vid-xt
β β βββ...
β βββwhisper-tiny
β β βββ...
β βββRIFE
β β βββflownet.pkl
β βββyoloface_v5m.pt
βββ...
Download by huggingface-cli
follow
python3 -m pip install "huggingface_hub[cli]"
huggingface-cli download EEEELY/DICE-Talk --local-dir checkpoints
huggingface-cli download stabilityai/stable-video-diffusion-img2vid-xt --local-dir checkpoints/stable-video-diffusion-img2vid-xt
huggingface-cli download openai/whisper-tiny --local-dir checkpoints/whisper-tiny
or manully download pretrain model, svd-xt and whisper-tiny to checkpoints/
.
python3 demo.py --image_path '/path/to/input_image' --audio_path '/path/to/input_audio'\
--emotion_path '/path/to/input_emotion' --output_path '/path/to/output_video'
python3 gradio_app.py
On the left you need to:
- Upload an image or take a photo
- Upload or record an audio clip
- Select the type of emotion to generate
- Set the strength for identity preservation and emotion generation
- Choose whether to crop the input image
On the right are the generated videos.
If you find our work helpful for your research, please consider citing our work.
@article{tan2025dicetalk,
title={Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation},
author={Tan, Weipeng and Lin, Chuming and Xu, Chengming and Xu, FeiFan and Hu, Xiaobin and Ji, Xiaozhong and Zhu, Junwei and Wang, Chengjie and Fu, Yanwei},
journal={arXiv preprint arXiv:2504.18087},
year={2025}
}
@article{ji2024sonic,
title={Sonic: Shifting Focus to Global Audio Perception in Portrait Animation},
author={Ji, Xiaozhong and Hu, Xiaobin and Xu, Zhihong and Zhu, Junwei and Lin, Chuming and He, Qingdong and Zhang, Jiangning and Luo, Donghao and Chen, Yi and Lin, Qin and others},
journal={arXiv preprint arXiv:2411.16331},
year={2024}
}
@article{ji2024realtalk,
title={Realtalk: Real-time and realistic audio-driven face generation with 3d facial prior-guided identity alignment network},
author={Ji, Xiaozhong and Lin, Chuming and Ding, Zhonggan and Tai, Ying and Zhu, Junwei and Hu, Xiaobin and Luo, Donghao and Ge, Yanhao and Wang, Chengjie},
journal={arXiv preprint arXiv:2406.18284},
year={2024}
}