DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility

This is the official implementation for DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility.

Abstract : Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters.

What we open source in this repository:

DiVISe implementation and an unofficial ReVISE re-implementation for Video-to-Speech task.
Scripts to train vocoders on resampled LJSpeech (16kHz). This is given in custom_hifigan as a git submodule forked from bshall's implementation.
Pretrained parameters for DiVISe and ReVISE in our implementation, as well as their vocoders.

Pre-requisites

Python 3.8
Install requirements.txt.
initialize submodule with git submodule update --init --recursive.
Replace SPEAKER_ENCODER_PATH with your path in env.py.

Data Preparation

(LRS3) Please refer to AV-HuBERT.
(LRS2) See dataset/lrs2/README.md for instructions. This is effectively identical to AV-HuBERT's preprocessing except the modifications we made to suit LRS2's file structure.

If you run into any problem with the data preparation pipeline, please feel free to open an issue to let us know.

Configuration Setup

Two config files will be needed to run the code. Genrally, conf/avhubert handles the data location and the model size, while conf/hifigan contains training-related hyper parameters.

conf/avhubert: Modification needed as follows:
- Replace ??? of task:data with your preprocessed data directory.
- Either preprocessed dir of LRS3 and LRS2 will be fine to fit task:data.
  - For low-resource setting, fit 30h dir of your preprocessed dataset.
  - For full-resource setting, fit 433h dir / 224h dir of your preprocessed dataset.
conf/hifigan: Modification not needed

Model Config File

DiVISe conf/hifigan/video2speech_template.json

ReVISE conf/hifigan/video2speech_revise_original.json

V2S Training

For strict reimplementation, please follow the number of GPUs given below. as the number of updates is set per GPU in our setting, the number of updates will differ with a different number of GPUs. RTX 3090 or RTX4090 will be fine in our case.

4 are required to run 30h setting.
8 are required to run 433h setting.

Script references are given as listed below. Here we assume we train with 433h full-resource setup with 8 GPUs.

Evironment Variables

your_ckpt_path=[REPLACE HERE] # ckpt path
your_avhubert_ckpt=[REPLACE HERE] # base_lrs3_iter5.pt for BASE setting and large_vox_iter5.pt for LARGE - setting (default) You may find these checkpoints here.
your_hifigan_ckpt=[REPLACE HERE] # You may find these checkpoints here.

Training Commands

Command	Model
`python train.py --checkpoint_path $your_ckpt_path --hifigan_config conf/hifigan/video2speech_template.json --avhubert_config conf/avhubert/large_avhubert_template.yaml --avhubert_ckpt $your_avhubert_ckpt --hifigan_ckpt $your_hifigan_ckpt --wandb`	DiVISe
`python train.py --checkpoint_path $your_ckpt_path --hifigan_config conf/hifigan/video2speech_revise_original.json --avhubert_config conf/avhubert/large_avhubert_template.yaml --avhubert_ckpt $your_avhubert_ckpt --hifigan_ckpt $your_hifigan_ckpt --wandb`	ReVISE

Evaluation

Evaluation can be done by simply adding an extra --test argument in Training Commands, with only a single GPU.
--save_samples can be added to export the synthesized audio under $your_ckpt_path.

Vocoder Fine-tuning

Enviornment Variables

hfg_ckpt_pretrained=[REPLACE HERE] # file path to vocoder pre-trained on 16k LJSpeech dataset.
data_dir=[REPLACE HERE] # fill with task:data in your avhubert config.
vc_ft_ckpt_path=[REPLACE HERE] # ckpt path for vocoder fine-tuning.
mel_postfix=[REPLACE HERE] # mel export file identifier.

Training Pipeline

export log Mel-Spectrograms with generate_mel.py.

python generate_mel.py --checkpoint_path $your_ckpt_path --avhubert_config conf/avhubert/large_avhubert_template.yaml --postfix $mel_postfix
cd custom_hifigan and fine-tune the vocoder on the generated log Mel-spectrograms with the following command with 8 GPUs.

python train.py $data_dir $vc_ft_ckpt_path --resume $hfg_ckpt_pretrained --finetune --npy mel_"$mel_postfix" --wandb

Pretrained Model

We release the links to model parameters trained with full resource setting of LRS3 in this paper here.

V2S Models

Model	Name
DiVISe	g_45000_divse
ReVISE (Our Implementation)	g_45000_revise

Vocoders

The vocoders are pre-trained on resampled version (16kHz) of LJSpeech Dataset.

Models	Name
HiFiGAN (Fine-tuned For DiVISe)	model_hfgfinetuned.pt
HiFiGAN (Pre-trained only)	model_hfgpretrained.pt
Unit-HiFiGAN (For ReVISE)	model_unithfg.pt

Audio Demos

Full samples for LRS3 test set can be downloaded here. Files with postfix _vc are generated with DiVISe as reported in the paper, while those with _gf are synthesized with Griffin-Lim for comparison.
We also provide a simple demo page here.

Acknowledgements

Special thanks to HiFi-GAN and AV-HuBERT, where this repository is built upon. We also appreciate all other works mentioned in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
FaRL		FaRL
audio		audio
avhubert		avhubert
conf		conf
contrastive		contrastive
custom_hifigan @ 96117d1		custom_hifigan @ 96117d1
dataset		dataset
emotion/ravdess		emotion/ravdess
hubert		hubert
preliminary		preliminary
pytorch_backend		pytorch_backend
speaker_encoder		speaker_encoder
speechtokenizer		speechtokenizer
vocoders		vocoders
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
env.py		env.py
evaluate_directory.py		evaluate_directory.py
generate_mel.py		generate_mel.py
models.py		models.py
requirements.txt		requirements.txt
test_audio_se.py		test_audio_se.py
test_dataset_integrity.py		test_dataset_integrity.py
train.py		train.py
train_emotion.py		train_emotion.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility

What we open source in this repository:

Pre-requisites

Data Preparation

Configuration Setup

V2S Training

Evironment Variables

Training Commands

Evaluation

Vocoder Fine-tuning

Enviornment Variables

Training Pipeline

Pretrained Model

V2S Models

Vocoders

Audio Demos

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Model	Config File
DiVISe	conf/hifigan/video2speech_template.json
ReVISE	conf/hifigan/video2speech_revise_original.json

License

PussyCat0700/DiVISe

Folders and files

Latest commit

History

Repository files navigation

DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility

What we open source in this repository:

Pre-requisites

Data Preparation

Configuration Setup

V2S Training

Evironment Variables

Training Commands

Evaluation

Vocoder Fine-tuning

Enviornment Variables

Training Pipeline

Pretrained Model

V2S Models

Vocoders

Audio Demos

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages