Jeongsoo Choi1*, Zhikang Niu2,3*, Ji-Hoon Kim1, Chunhui Wang4, Joon Son Chung1, Xie Chen2,3
1School of Electrical Engineering, KAIST, South Korea
2MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science,
Shanghai Jiao Tong University, China
3Shanghai Innovation Institute, China
4Geely, China
*means equal contribution
🚀 [2025.5] We release all the code to promote the research of accelerating diffusion-based TTS models.
🚀 [2025.5.19] Our paper is accepted to Interspeech 2025, hope to see you in the conference!
- Dual Modality Alignment: A novel training paradigm that aligns the text and audio modalities in a dual manner, enhancing the model's ability to generate fluent and faithful speech.
- Plug and Play for diffusion-based TTS: A-DMA can be easily integrated into existing diffusion-based TTS models, providing a simple yet effective way to improve their performance.
- Accelerated Training: A-DMA significantly accelerates the convergence of diffusion-based TTS models and maintains the same inference speed as the original models.
- High-Quality Speech Generation: A-DMA achieves high-quality speech generation with improved fluency and faithfulness, making it suitable for various TTS applications.
- Open-Source: A-DMA is open-sourced to promote research in the field of TTS and to provide a baseline for future work.
# We recommend using conda to create a new environment.
conda create -n adma python=3.10
conda activate adma
git clone https://github.com/ZhikangNiu/A-DMA.git
cd A-DMA
# Install PyTorch >= 2.2.0, e.g.,
pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124
# Install editable version of A-DMA
pip install -e .
We provide a docker image for easy usage.
# Build from Dockerfile
docker build -t adma:v1 .
# Run from GitHub Container Registry
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/zhikangniu/a-dma:main
Our training process is based on the F5-TTS, if you have any questions, please check its issues and README first.
You can follow the instructions to prepare datasets for training, and our experiments are based on the LibriTTS dataset.
python src/f5_tts/train/datasets/prepare_libritts.py
We use Speech Foundation Model to extract features for speech alignment. You can use the following command to extract features for LibriTTS dataset after preparing the dataset.
bash extract_features.sh
Once your datasets are prepared, you can start the training process.
# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
accelerate config
# if you want to save the config to a specific path, you can use:
# accelerate config --config_file /path/to/config.yaml
accelerate launch src/f5_tts/train/train_adma.py --config-name F5TTS_v1_Small.yaml
# possible to overwrite accelerate and hydra config
accelerate launch --mixed_precision=fp16 src/f5_tts/train/train.py --config-name F5TTS_v1_Base.yaml ++datasets.batch_size_per_gpu=19200
Read training & finetuning guidance for more instructions.
If you want to evaluate the model, please refer to evaluation guidance for more details. Notably, A-DMA methods doesn't affect the original inference process, so we have same RTF and inference speed as F5-TTS.
Our work is built upon the following open-source project F5-TTS. Thanks to the authors for their great work, and if you have any questions, you can first check them on F5-TTS issues.
Our code is released under MIT License. If our work and codebase is useful for you, please cite as:
@article{chen-etal-2024-f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
journal={arXiv preprint arXiv:2410.06885},
year={2024},
}
@article{choi2025accelerating,
title={Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment},
author={Choi, Jeongsoo and Niu, Zhikang and Kim, Ji-Hoon and Wang, Chunhui and Chung, Joon Son and Chen, Xie},
journal={arXiv preprint arXiv:2505.19595},
year={2025}
}