Sample-Efficient Diffusion for Text-To-Speech Synthesis

This repository contains the implementation for SESD, a latent diffusion model for text-to-speech generation. We introduced this model in the following work:

Sample-Efficient Diffusion for Text-To-Speech Synthesis
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
Interspeech 2024
[paper]

Abstract

This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech – far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.

Overview of our Sample-Efficient Speech Diffusion (SESD) architecture.

If you find this work useful, please consider citing:

@inproceedings{lovelace2024sample,
  title={Sample-Efficient Diffusion for Text-To-Speech Synthesis},
  author={Lovelace, Justin and Ray, Soham and Kim, Kwangyoun and Weinberger, Kilian Q and Wu, Felix},
  booktitle={Proc. Interspeech 2024},
  pages={4403--4407},
  year={2024}
}

Installation

Install the required dependencies:

pip install -r requirements.txt

Datasets

We train and evaluate our models using the LibriSpeech dataset from the Hugging Face Hub and use the standard LibriSpeech dataset for evaluation.

Speaker Prompt Data

For speaker-prompted generation, we utilize three seconds of another prompt. To extract the corresponding transcript for the speech, we utilized the Montreal Forced Aligner. An aligned version of LibriSpeech can be found at:

data/aligned_librispeech.tar.gz

After extracting the archive, update the ALIGNED_DATA_DIR path in audio_datasets/constants.py to point to your data directory.

Training

Our training setup:

Single Nvidia A6000 GPU
BF16 mixed precision training
Batch size and other parameters may need adjustment based on your hardware

To train the diffusion model:

./scripts/train/train.sh

Model Checkpoint

Model checkpoint will be released soon!

Inference

To synthesize speech for the LibriSpeech test-clean set:

./scripts/sample/sample_16_ls_testclean.sh

Note: Update the --resume_dir argument with the path to your trained model.

Questions and Support

Feel free to create an issue if you have any questions.

Acknowledgements

This work built upon excellent open-source implementations from Phil Wang (Lucidrains). Specifically, we built off of his PyTorch DDPM implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
audio_datasets		audio_datasets
data		data
diffusion		diffusion
evaluation		evaluation
img		img
models		models
neural_codec		neural_codec
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_wer.py		compute_wer.py
requirements.txt		requirements.txt
train_audio_diffusion.py		train_audio_diffusion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Abstract

Installation

Datasets

Speaker Prompt Data

Training

Model Checkpoint

Inference

Questions and Support

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

justinlovelace/SESD

Folders and files

Latest commit

History

Repository files navigation

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Abstract

Installation

Datasets

Speaker Prompt Data

Training

Model Checkpoint

Inference

Questions and Support

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages