This repository contains the implementation for SESD, a latent diffusion model for text-to-speech generation. We introduced this model in the following work:
Sample-Efficient Diffusion for Text-To-Speech Synthesis
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
Interspeech 2024
[paper]
This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech – far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
If you find this work useful, please consider citing:
@inproceedings{lovelace2024sample,
title={Sample-Efficient Diffusion for Text-To-Speech Synthesis},
author={Lovelace, Justin and Ray, Soham and Kim, Kwangyoun and Weinberger, Kilian Q and Wu, Felix},
booktitle={Proc. Interspeech 2024},
pages={4403--4407},
year={2024}
}
Install the required dependencies:
pip install -r requirements.txt
We train and evaluate our models using the LibriSpeech dataset from the Hugging Face Hub and use the standard LibriSpeech dataset for evaluation.
For speaker-prompted generation, we utilize three seconds of another prompt. To extract the corresponding transcript for the speech, we utilized the Montreal Forced Aligner. An aligned version of LibriSpeech can be found at:
data/aligned_librispeech.tar.gz
After extracting the archive, update the ALIGNED_DATA_DIR
path in audio_datasets/constants.py
to point to your data directory.
Our training setup:
- Single Nvidia A6000 GPU
- BF16 mixed precision training
- Batch size and other parameters may need adjustment based on your hardware
To train the diffusion model:
./scripts/train/train.sh
Model checkpoint will be released soon!
To synthesize speech for the LibriSpeech test-clean set:
./scripts/sample/sample_16_ls_testclean.sh
Note: Update the --resume_dir
argument with the path to your trained model.
Feel free to create an issue if you have any questions.
This work built upon excellent open-source implementations from Phil Wang (Lucidrains). Specifically, we built off of his PyTorch DDPM implementation.