10000 GitHub - ishine/A-DMA: [INTERSPEECH 2025]Official code for "Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ A-DMA Public
forked from ZhikangNiu/A-DMA

[INTERSPEECH 2025]Official code for "Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment"

License

Notifications You must be signed in to change notification settings

ishine/A-DMA

 
 

Repository files navigation

A-DMA: Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

Jeongsoo Choi1*, Zhikang Niu2,3*, Ji-Hoon Kim1, Chunhui Wang4, Joon Son Chung1, Xie Chen2,3

1School of Electrical Engineering, KAIST, South Korea
2MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science,
Shanghai Jiao Tong University, China 3Shanghai Innovation Institute, China 4Geely, China   
*means equal contribution

📜 News

🚀 [2025.5] We release all the code to promote the research of accelerating diffusion-based TTS models.

🚀 [2025.5.19] Our paper is accepted to Interspeech 2025, hope to see you in the conference!

💡 Highlights

  1. Dual Modality Alignment: A novel training paradigm that aligns the text and audio modalities in a dual manner, enhancing the model's ability to generate fluent and faithful speech.
  2. Plug and Play for diffusion-based TTS: A-DMA can be easily integrated into existing diffusion-based TTS models, providing a simple yet effective way to improve their performance.
  3. Accelerated Training: A-DMA significantly accelerates the convergence of diffusion-based TTS models and maintains the same inference speed as the original models.
  4. High-Quality Speech Generation: A-DMA achieves high-quality speech generation with improved fluency and faithfulness, making it suitable for various TTS applications.
  5. Open-Source: A-DMA is open-sourced to promote research in the field of TTS and to provide a baseline for future work.

🛠️ Usage

1. Install environment and dependencies

# We recommend using conda to create a new environment.
conda create -n adma python=3.10
conda activate adma

git clone https://github.com/ZhikangNiu/A-DMA.git
cd A-DMA

# Install PyTorch >= 2.2.0, e.g.,
pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124

# Install editable version of A-DMA
pip install -e .

2. Docker usage also available

We provide a docker image for easy usage.

# Build from Dockerfile
docker build -t adma:v1 .

# Run from GitHub Container Registry
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/zhikangniu/a-dma:main

3.Training

Our training process is based on the F5-TTS, if you have any questions, please check its issues and README first.

Prepare datasets

You can follow the instructions to prepare datasets for training, and our experiments are based on the LibriTTS dataset.

python src/f5_tts/train/datasets/prepare_libritts.py

Extract Speech Foundation Model's Features

We use Speech Foundation Model to extract features for speech alignment. You can use the following command to extract features for LibriTTS dataset after preparing the dataset.

bash extract_features.sh

Train the model

Once your datasets are prepared, you can start the training process.

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
accelerate config
# if you want to save the config to a specific path, you can use:
# accelerate config --config_file /path/to/config.yaml

accelerate launch src/f5_tts/train/train_adma.py --config-name F5TTS_v1_Small.yaml

# possible to overwrite accelerate and hydra config
accelerate launch --mixed_precision=fp16 src/f5_tts/train/train.py --config-name F5TTS_v1_Base.yaml ++datasets.batch_size_per_gpu=19200

Read training & finetuning guidance for more instructions.

4.Evaluation

If you want to evaluate the model, please refer to evaluation guidance for more details. Notably, A-DMA methods doesn't affect the original inference process, so we have same RTF and inference speed as F5-TTS.

❤️ Acknowledgments

Our work is built upon the following open-source project F5-TTS. Thanks to the authors for their great work, and if you have any questions, you can first check them on F5-TTS issues.

✒️ Citation and License

Our code is released under MIT License. If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}
@article{choi2025accelerating,
  title={Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment},
  author={Choi, Jeongsoo and Niu, Zhikang and Kim, Ji-Hoon and Wang, Chunhui and Chung, Joon Son and Chen, Xie},
  journal={arXiv preprint arXiv:2505.19595},
  year={2025}
}

About

[INTERSPEECH 2025]Official code for "Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Shell 1.4%
  • Dockerfile 0.2%
0