- [β] Release EmoDubber's training and inference code (Basic Fuction). (Fixed on 5/27/2025)
- [β] Upload pre-processed dataset features to Baidu Cloud and Google Cloud. (Done 5/27/2025)
- [β] Release model checkpoint (Basic Fuction) to inference waveform. (Before 6/1/2025)
- [-] Release EmoDubber's emotion controlling code (Emotion Fuction).
- [-] Provide metrics testing scripts (LSE-C, LSE-D, SECS, WER, MCD).
- Clone this repository:
git clone https://github.com/GalaxyCong/EmoDubber.git
cd EmoDubber
- Create an environment
conda create -n emodub python=3.10 -y
conda activate emodub
- Install python requirements:
pip install -r requirements.txt
- Install monotonic_align
pip install git+https://github.com/resemble-ai/monotonic_align.git
- (Option) Last step. Download trainer.py to replace your
anaconda3/envs/emodub/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py
;
Download checkpoint_connector.py to replace youranaconda3/envs/emodub/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py
(Note: If you want to train model from scratch, step 5 is required. If you only want to do inference, please ignore it. Step5 is used to prevent the error of Missing key(s) in state_dict (TTS_model.ckpt >> EmoDubber_all). I avoid this problem by setting "strict=False" in torch lightning.)
When performing training, both Raw Audio and Prosessed Features need to be downloaded. When inferencing, only Prosessed Features are needed.
-
Chem 16KHz Raw Audio: Google Drive || Baidu Drive(erap)
-
Chem Prosessed Feature: Google Drive || Baidu Drive(nriv)
-
GRID 16KHz Raw Audio: Google Drive || Baidu Drive(xikd)
-
GRID Prosessed Feature: Google Drive || Baidu Drive(cbdy)
- Ensure input path is correct (see
configs/data/Chem_dataset.yaml
orconfigs/data/GRID_dataset
). - Download TTS_model.ckpt (pretraining on LibriTTS-clean-100 dataset) and save it in
Pretrained_TTSmodel
folder. - Finally, please stay in root directory, and run directly:
python EmoDubber_Networks/Train_EmoDubber_Chem16K.py
or
python EmoDubber_Networks/Train_EmoDubber_GRID16K.py
We provide EmoDubber's checkpoints (Basic Fuction). Moreover, we also provide the audio generated by these checkpoints. They were used to compare with other SOTA dubbing baseline in main setting (Setting1 & Setting 2), .ie, without emotion control. It will hopefully facilitate future comparisons.
The links are given below:
-
Checkpoint: Google Drive or Baidu Drive(sxus)
-
Generated Result: Google Drive or Baidu Drive(heu2)
-
Checkpoint: Google Drive or Baidu Drive(hv9t)
-
Generated Result: Google Drive or Baidu Drive(2ibw)
-
Download EmoDubber's 16k Hz Vocoder and save it to
./Vocoder_16KHz
folder. -
Run script for inference (stay in root):
- For main setting1:
python EmoDubber_Networks/Inference_Chem_Unbatch_New_S1.py \
--checkpoint_path [model_dir] \
--vocoder_checkpoint_path [vocoder_dir] \
--Val_list [script_dir] \
--Silent_Lip [lip_dir] \
--Silent_Face [face_dir] \
--Refence_audio [reference_dir] \
- For main setting2:
python EmoDubber_Networks/Inference_Chem_Unbatch_New_S2.py \
--checkpoint_path [model_dir] \
--vocoder_checkpoint_path [vocoder_dir] \
--Val_list [script_dir] \
--Silent_Lip [lip_dir] \
--Silent_Face [face_dir] \
--Refence_audio [reference_dir] \
--Set2_list [script2_dir] \
checkpoint_path
: Path to the directory containing checkpoint files. We have provided our checkpoints.vocoder_checkpoint_path
: Path to the vocoder that matches EmoDubber. Default inVocoder_16KHz
folder.Val_list
: Path to txt script. Equal tovalid_filelist_path
in./configs/data/*.yaml
.Silent_Lip
: Path to lip-motion. Equal tolip_embedding_path
in./configs/data/*.yaml
.Silent_Face
: Path to face feature. Equal toVA_path
in./configs/data/*.yaml
.Refence_audio
: Path to reference audio feature. Equal toSpeaker_GE2E_ID_path
in./configs/data/*.yaml
.Set2_list
: Path to txt script of setting2, requried in runningInference_Chem_Unbatch_New_S2.py
orInference_GRID_Unbatch_New_S2.py
. It aims to avoid using the target audio as the reference, the reference audio should come from another clip.Set2_list
can be download here.
Under construction
We provide all checkpoints. Below is the checkpoint of our emotional expert classifier.
π Five types of emotions (Recommendation): https://drive.google.com/drive/folders/1vSVTAkZsoinSlYgeVCvBXs5V2k-gIurV?usp=sharing
π Seven types of emotions (Recommendation): https://drive.google.com/drive/folders/1h0Y1TChA9vgX3_6u5GUrJK69n0rgKTRU?usp=sharing
Seven types of emotions with emotionless data augmentation: https://drive.google.com/drive/folders/1DuhQYe5FuowHBRMOFRhthrvPJBlfK_5E?usp=sharing
Under construction
Code: MIT License
If you find this helps your research, please consider citing:
@article{cong2024emodubber,
title={EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing},
author={Cong, Gaoxiang and Pan, Jiadong and Li, Liang and Qi, Yuankai and Peng, Yuxin and Hengel, Anton van den and Yang, Jian and Huang, Qingming},
journal={arXiv preprint arXiv:2412.08988},
year={2024}
}
My email is gaoxiang.cong@vipl.ict.ac.cn
Any discussions and suggestions are welcome!