A-DMA: Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

Jeongsoo Choi^1*, Zhikang Niu^2,3*, Ji-Hoon Kim¹, Chunhui Wang⁴, Joon Son Chung¹, Xie Chen^2,3

¹School of Electrical Engineering, KAIST, South Korea
²MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science,
Shanghai Jiao Tong University, China ³Shanghai Innovation Institute, China ⁴Geely, China
^*means equal contribution

📜 News

🚀 [2025.5] We release all the code to promote the research of accelerating diffusion-based TTS models.

🚀 [2025.5.19] Our paper is accepted to Interspeech 2025, hope to see you in the conference!

💡 Highlights

Dual Modality Alignment: A novel training paradigm that aligns the text and audio modalities in a dual manner, enhancing the model's ability to generate fluent and faithful speech.
Plug and Play for diffusion-based TTS: A-DMA can be easily integrated into existing diffusion-based TTS models, providing a simple yet effective way to improve their performance.
Accelerated Training: A-DMA significantly accelerates the convergence of diffusion-based TTS models and maintains the same inference speed as the original models.
High-Quality Speech Generation: A-DMA achieves high-quality speech generation with improved fluency and faithfulness, making it suitable for various TTS applications.
Open-Source: A-DMA is open-sourced to promote research in the field of TTS and to provide a baseline for future work.

🛠️ Usage

1. Install environment and dependencies

# We recommend using conda to create a new environment.
conda create -n adma python=3.10
conda activate adma

git clone https://github.com/ZhikangNiu/A-DMA.git
cd A-DMA

# Install PyTorch >= 2.2.0, e.g.,
pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124

# Install editable version of A-DMA
pip install -e .

2. Docker usage also available

We provide a docker image for easy usage.

# Build from Dockerfile
docker build -t adma:v1 .

# Run from GitHub Container Registry
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/zhikangniu/a-dma:main

3.Training

Our training process is based on the F5-TTS, if you have any questions, please check its issues and README first.

Prepare datasets

You can follow the instructions to prepare datasets for training, and our experiments are based on the LibriTTS dataset.

python src/f5_tts/train/datasets/prepare_libritts.py

Extract Speech Foundation Model's Features

We use Speech Foundation Model to extract features for speech alignment. You can use the following command to extract features for LibriTTS dataset after preparing the dataset.

bash extract_features.sh

Train the model

Once your datasets are prepared, you can start the training process.

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
accelerate config
# if you want to save the config to a specific path, you can use:
# accelerate config --config_file /path/to/config.yaml

accelerate launch src/f5_tts/train/train_adma.py --config-name F5TTS_v1_Small.yaml

# possible to overwrite accelerate and hydra config
accelerate launch --mixed_precision=fp16 src/f5_tts/train/train.py --config-name F5TTS_v1_Base.yaml ++datasets.batch_size_per_gpu=19200

Read training & finetuning guidance for more instructions.

4.Evaluation

If you want to evaluate the model, please refer to evaluation guidance for more details. Notably, A-DMA methods doesn't affect the original inference process, so we have same RTF and inference speed as F5-TTS.

❤️ Acknowledgments

Our work is built upon the following open-source project F5-TTS. Thanks to the authors for their great work, and if you have any questions, you can first check them on F5-TTS issues.

✒️ Citation and License

Our code is released under MIT License. If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

@article{choi2025accelerating,
  title={Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment},
  author={Choi, Jeongsoo and Niu, Zhikang and Kim, Ji-Hoon and Wang, Chunhui and Chung, Joon Son and Chen, Xie},
  journal={arXiv preprint arXiv:2505.19595},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
data		data
src/f5_tts		src/f5_tts
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
extract_feature.sh		extract_feature.sh
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A-DMA: Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

📜 News

💡 Highlights

🛠️ Usage

1. Install environment and dependencies

2. Docker usage also available

3.Training

Prepare datasets

Extract Speech Foundation Model's Features

Train the model

4.Evaluation

❤️ Acknowledgments

✒️ Citation and License

About

Uh oh!

Releases

Packages

Languages

License

ishine/A-DMA

Folders and files

Latest commit

History

Repository files navigation

A-DMA: Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

📜 News

💡 Highlights

🛠️ Usage

1. Install environment and dependencies

2. Docker usage also available

3.Training

Prepare datasets

Extract Speech Foundation Model's Features

Train the model

4.Evaluation

❤️ Acknowledgments

✒️ Citation and License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages