This is the official implementation for DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility.
Abstract : Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters.
- DiVISe implementation and an unofficial ReVISE re-implementation for Video-to-Speech task.
- Scripts to train vocoders on resampled LJSpeech (16kHz). This is given in custom_hifigan as a git submodule forked from bshall's implementation.
- Pretrained parameters for DiVISe and ReVISE in our implementation, as well as their vocoders.
- Python 3.8
- Install requirements.txt.
- initialize submodule with
git submodule update --init --recursive
. - Replace
SPEAKER_ENCODER_PATH
with your path in env.py.
- (LRS3) Please refer to AV-HuBERT.
- (LRS2) See dataset/lrs2/README.md for instructions. This is effectively identical to AV-HuBERT's preprocessing except the modifications we made to suit LRS2's file structure.
If you run into any problem with the data preparation pipeline, please feel free to open an issue to let us know.
Two config files will be needed to run the code. Genrally, conf/avhubert
handles the data location and the model size, while conf/hifigan
contains training-related hyper parameters.
-
conf/avhubert
: Modification needed as follows:- Replace
???
oftask:data
with your preprocessed data directory. - Either preprocessed dir of LRS3 and LRS2 will be fine to fit
task:data
.- For low-resource setting, fit
30h
dir of your preprocessed dataset. - For full-resource setting, fit
433h
dir /224h
dir of your preprocessed dataset.
- For low-resource setting, fit
- Replace
-
conf/hifigan
: Modification not neededModel Config File DiVISe conf/hifigan/video2speech_template.json ReVISE conf/hifigan/video2speech_revise_original.json
For strict reimplementation, please follow the number of GPUs given below. as the number of updates is set per GPU in our setting, the number of updates will differ with a different number of GPUs. RTX 3090 or RTX4090 will be fine in our case.
- 4 are required to run 30h setting.
- 8 are required to run 433h setting.
Script references are given as listed below. Here we assume we train with 433h full-resource setup with 8 GPUs.
- your_ckpt_path=[REPLACE HERE] # ckpt path
- your_avhubert_ckpt=[REPLACE HERE] # base_lrs3_iter5.pt for BASE setting and large_vox_iter5.pt for LARGE - setting (default) You may find these checkpoints here.
- your_hifigan_ckpt=[REPLACE HERE] # You may find these checkpoints here.
Command | Model |
---|---|
python train.py --checkpoint_path $your_ckpt_path --hifigan_config conf/hifigan/video2speech_template.json --avhubert_config conf/avhubert/large_avhubert_template.yaml --avhubert_ckpt $your_avhubert_ckpt --hifigan_ckpt $your_hifigan_ckpt --wandb |
DiVISe |
python train.py --checkpoint_path $your_ckpt_path --hifigan_config conf/hifigan/video2speech_revise_original.json --avhubert_config conf/avhubert/large_avhubert_template.yaml --avhubert_ckpt $your_avhubert_ckpt --hifigan_ckpt $your_hifigan_ckpt --wandb |
ReVISE |
- Evaluation can be done by simply adding an extra
--test
argument in Training Commands, with only a single GPU. --save_samples
can be added to export the synthesized audio under$your_ckpt_path
.
- hfg_ckpt_pretrained=[REPLACE HERE] # file path to vocoder pre-trained on 16k LJSpeech dataset.
- data_dir=[REPLACE HERE] # fill with task:data in your avhubert config.
- vc_ft_ckpt_path=[REPLACE HERE] # ckpt path for vocoder fine-tuning.
- mel_postfix=[REPLACE HERE] # mel export file identifier.
-
export log Mel-Spectrograms with generate_mel.py.
python generate_mel.py --checkpoint_path $your_ckpt_path --avhubert_config conf/avhubert/large_avhubert_template.yaml --postfix $mel_postfix
-
cd custom_hifigan
and fine-tune the vocoder on the generated log Mel-spectrograms with the following command with 8 GPUs.python train.py $data_dir $vc_ft_ckpt_path --resume $hfg_ckpt_pretrained --finetune --npy mel_"$mel_postfix" --wandb
We release the links to model parameters trained with full resource setting of LRS3 in this paper here.
Model | Name |
---|---|
DiVISe | g_45000_divse |
ReVISE (Our Implementation) | g_45000_revise |
The vocoders are pre-trained on resampled version (16kHz) of LJSpeech Dataset.
Models | Name |
---|---|
HiFiGAN (Fine-tuned For DiVISe) | model_hfgfinetuned.pt |
HiFiGAN (Pre-trained only) | model_hfgpretrained.pt |
Unit-HiFiGAN (For ReVISE) | model_unithfg.pt |
- Full samples for LRS3 test set can be downloaded here. Files with postfix
_vc
are generated with DiVISe as reported in the paper, while those with_gf
are synthesized with Griffin-Lim for comparison. - We also provide a simple demo page here.
Special thanks to HiFi-GAN and AV-HuBERT, where this repository is built upon. We also appreciate all other works mentioned in this repository.