HPMDubbing🎬 - PyTorch Implementation

In this paper, we propose a novel movie dubbing architecture via hierarchical prosody modeling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by the psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings are used together to generate mel-spectrogram, which is then converted into speech waves by an existing vocoder. Extensive experimental results on the V2C and Chem benchmark datasets demonstrate the favourable performance of the proposed method.

🌟🌟🌟🥳 Here is a display of the Demo generated by HPMDubbing:

chem1.mp4

📝Text: That defines our equilibrium.

Chem_2.mp4

📝Text: Each gas will exert what's called a partial pressure.

V2C1.mp4

📝Text: Yes. I'm the baby Jesus.

V2C_2.mp4

📝Text: This? No! Oh, no. This is just a temporary thing.

V2C_3.mp4

📝Text: It was an accident. She was scared.

Visit our demo website or download the generated samples by HPMDubbing model (🔊result on Chem and 🔊result on V2C) to see more results.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Dataset

1) For V2C

V2C-MovieAnimation is a multi-speaker dataset for animation movie dubbing with identity and emotion annotations. It is collected from 26 Disney cartoon movies and covers 153 diverse characters. Due to the copyright, we can not directly provide the dataset, see V2C issue.

In this work, we release the V2C-MovieAnimation2.0 to satisfy the requirement of dubbing the specified characters. Specifically, we removed redundant character faces in movie frames (please note that our video frames are sampled at 25 FPS by ffmpeg). You can download our preprocessed features directly through the link GoogleDrive or BaiduDrive (password: Good).

2) For Chem

The Chem dataset is provided by Neural Dubber, which belongs to the single-speaker chemistry lecture dataset from Lip2Wav.

Data Preparation

For voice preprocessing (mel-spectrograms, pitch, and energy), Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alternatively, you can skip the below-complicated step, and use our extracted features, directly.

Download the official Montreal Forced Aligner (MFA) package and run

./montreal-forced-aligner/bin/mfa_align /data/conggaoxiang/HPMDubbing/V2C_Data/wav16 /data/conggaoxiang/HPMDubbing/lexicon/librispeech-lexicon.txt  english /data/conggaoxiang/HPMDubbing/V2C_Code/example_V2C16/TextGrid -j

then, please run the below script to save the .npy files of mel-spectrograms, pitch, and energy from two datasets, respectively.

python V2C_preprocess.py config/MovieAnimation/preprocess.yaml

python Chem_preprocess.py config/MovieAnimation/preprocess.yaml

For hierarchical visual feature preprocessing (lip, face, and scenes), we detect and crop the face from the video frames using $S^3FD$ face detection model. Then, we align faces to generate 68 landmarks and bounding boxes (./landmarks and ./boxes). Finally, we get the mouth ROIs from all video clips, following EyeLipCropper. Similarly, you can also skip the complex steps below and directly use the features we extracted.

We use the pre-trained weights of emonet to extract affective display features, and fine-tune Arousal and Valence (dimension256) according to the last layer of emonet network.

python V2C_emotion.py -c emonet_8.pth -o /data/conggaoxiang/V2C_feature/example_V2C_framelevel/MovieAnimation/VA_feature -i /data/conggaoxiang/detect_face

The lip feature is extracted by resnet18_mstcn_video, which inputs the grayscale mouth ROIs for each video.

python lip_main.py --modality video --config-path /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/configs/lrw_resnet18_mstcn.json --model-path /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/models/lrw_resnet18_mstcn_video.pth --data-dir /data/conggaoxiang/lip/Lipreading_using_Temporal_Convolutional_Networks-master/MOUTH_processing --annonation-direc /data/conggaoxiang/lip/LRW_dataset/lipread_mp4 --test

Finally, the scenes feature is provided by V2C-Net from I3D model.

python ./emotion_encoder/video_features/emotion_encoder.py

Vocoder

We provide the pre-trained model and implementation details of HPMDubbing_Vocoder. Please download the vocoder of HPMDubbing and put it into the vocoder/HiFi_GAN_16/ or /vocoder/HiFi_GAN_220/ folder. Before running, remember to check line 63 of model.yaml and change it to your own path.

vocoder:
  model: [HiFi_GAN_16] or [HiFi_GAN_220]
  speaker: "LJSpeech" 
  vocoder_checkpoint_path: [Your path]

Training

For V2C-MovieAnimation dataset, please run train.py file with

python train.py -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml -p2 config/MovieAnimation/preprocess.yaml

For Chem dataset, please run train.py file with

python train.py -p config/Chem/preprocess.yaml -m config/Chem/model.yaml -t config/Chem/train.yaml -p2 config/Chem/preprocess.yaml

Pretrained models

You can also use pretrained models we provide, download pretrained models through the link GoogleDrive or BaiduDrive (password: star). And synthesize the speech generated by the model through the following command:

python Synthesis.py --restore_step [Chekpoint] -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml -p2 config/MovieAnimation/preprocess.yaml

Tensorboard

Use

tensorboard --logdir output/log/MovieAnimation --port= [Your port]

or

tensorboard --logdir output/log/Chem --port= [Your port]

to serve TensorBoard on your localhost. The loss curves, mcd curves, synthesized mel-spectrograms, and audios are shown.

References

V2C: Visual Voice Cloning, Q. Chen, et al.
Neural Dubber: Dubbing for Videos According to Scripts, C. Hu, et al.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.

Citation

If our research and this repository are helpful to your work, please cite with:

@inproceedings{cong2023learning,
  title={Learning to Dub Movies via Hierarchical Prosody Models},
  author={Cong, Gaoxiang and Li, Liang and Qi, Yuankai and Zha, Zheng-Jun and Wu, Qi and Wang, Wenyu and Jiang, Bin and Yang, Ming-Hsuan and Huang, Qingming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14687--14697},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.idea		.idea
EmoFAN/emonet		EmoFAN/emonet
LipEncoder/Lipreading_using_Temporal_Convolutional_Networks		LipEncoder/Lipreading_using_Temporal_Convolutional_Networks
__pycache__		__pycache__
audio		audio
config		config
emotion_encoder		emotion_encoder
hifigan		hifigan
images		images
lexicon		lexicon
model		model
output/ckpt		output/ckpt
preprocessor		preprocessor
speaker_encoder		speaker_encoder
style_models		style_models
text		text
transformer		transformer
utils		utils
vocoder		vocoder
.DS_Store		.DS_Store
Inference.py		Inference.py
LICENSE		LICENSE
README.md		README.md
Synthesis.py		Synthesis.py
dataset.py		dataset.py
evaluate.py		evaluate.py
mcd.py		mcd.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPMDubbing🎬 - PyTorch Implementation

Dependencies

Dataset

1) For V2C

2) For Chem

Data Preparation

Vocoder

Training

Pretrained models

Tensorboard

References

Citation

About

Uh oh!

Releases

Packages

Languages

License

tomdiudiu/HPMDubbing

Folders and files

Latest commit

History

Repository files navigation

HPMDubbing🎬 - PyTorch Implementation

Dependencies

Dataset

1) For V2C

2) For Chem

Data Preparation

Vocoder

Training

Pretrained models

Tensorboard

References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages