GitHub - MYZY-AI/Muyan-TTS

Muyan-TTS 🤗 🤖 | Muyan-TTS-SFT 🤗 🤖 | Technical Report

Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.

🔥🔥🔥 News!!

April 29, 2025: 👋 We release the zero-shot TTS model weights of Muyan-TTS.
April 29, 2025: 👋 We release the few-shot TTS model weights of Muyan-TTS-SFT, which is trained based on Muyan-TTS with dozens of minutes of a single speaker's speech.
April 29, 2025: 👋 We release the training code from the base model to the SFT model for speaker adaptation.
April 29, 2025: 👋 We release the technical report of Muyan-TTS.

Summary

Framework

Framework of Muyan-TTS. Left is an LLM that models the parallel corpus of text (in blue) and audio (in green) tokens. Right is a SoVITS model that decodes the generated audio tokens, as well as phonemes and speaker embeddings, into the audio waveform.

Data

Data processing pipeline. The final dataset comprises over 100,000 hours of high-quality speech and corresponding transcriptions, forming a robust parallel corpus suitable for TTS training in long-form audio scenarios such as podcasts.

Training costs

Training Cost	Data Processing	Pre-training of LLM	Training of Decoder	Total
in GPU Hours	60K(A10)	19.2K(A100)	1.34K(A100)	-
in USD	$30K	$19.2K	$1.34K	$50.54K

Training costs of Muyan-TTS, assuming the rental price of A10 and A100 in GPU hour is $0.5 and $1, respectively.

Synthesis speed

We denote r as the inference time needed to generate one second of audio and compare the synthesis speed with several open-source TTS models.

Model	CosyVoice2	Step-Audio	Spark-TTS	FireRedTTS	GPT-SoVITS v3	Muyan-TTS
r ↓	2.19	0.90	1.31	0.61	0.48	0.33

All the inference process ran on a single NVIDIA A100 (40G, PCIe) GPU, and the baseline models were evaluated using their official inference implementations.

Note: Muyan-TTS only supports English input since the training data is heavily skewed toward English.

Demo

demo.mov

The three audios in the "Base model" column and the first audio in the "SFT model" column are synthesized by the open-sourced Muyan-TTS and Muyan-TTS-SFT, respectively. The last two audios in the "SFT model" column are generated by the SFT models trained separately on the base model, which are not open for use.

Install

Clone & Install

git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS

conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build

You need to install FFmpeg. If you're using Ubuntu, you can install it with the following command:

sudo apt update
sudo apt install ffmpeg

Model Download

Models	Links
Muyan-TTS	huggingface \| modelscope
Muyan-TTS-SFT	huggingface \| modelscope

Additionally, you need to download the weights of chinese-hubert-base.

Place all the downloaded models in the pretrained_models directory. Your directory structure should look similar to the following:

pretrained_models
├── chinese-hubert-base
├── Muyan-TTS
└── Muyan-TTS-SFT

Quickstart

python tts.py

This will synthesize speech through inference. The core code is as follows:

async def main(model_type, model_path):
    tts = Inference(model_type, model_path, enable_vllm_acc=False)
    wavs = await tts.generate(
        ref_wav_path="assets/Claire.wav",
        prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
        text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    )
    output_path = "logs/tts.wav"
    with open(output_path, "wb") as f:
        f.write(next(wavs))  
    print(f"Speech generated in {output_path}")

You need to specify the prompt speech, including the ref_wav_path and its prompt_text, and the 8AC5 text to be synthesized. The synthesized speech is saved by default to logs/tts.wav.

Additionally, you need to specify model_type as either base or sft, with the default being base.

When you specify the model_type to be base, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.

When you specify the model_type to be sft, you need to keep the prompt speech unchanged because the sft model is trained on Claire's voice.

API Usage

python api.py

Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port 8020. Additionally, LLM logs will be saved in logs/llm.log.

Similarly, you need to specify model_type as either base or sft, with the default being base. Note that the model_path should be consistent with your specified model_type.

You can send a request to the API using the example below:

import time
import requests
TTS_PORT=8020
payload = {
    "ref_wav_path": "assets/Claire.wav",
    "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together.",
    "temperature": 0.6,
    "speed": 1.0,
}
start = time.time()

url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
    f.write(response.content)
    
print(time.time() - start)

By default, the synthesized speech will be saved at logs/tts.wav.

Training

We use LibriSpeech as an example. You can use your own dataset instead, but you need to organize the data into the format shown in data_process/examples.

If you haven't downloaded LibriSpeech yet, you can download the dev-clean set using:

wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz

After uncompressing the data, specify the librispeech_dir in prepare_sft_dataset.py to be the parent folder of your LibriSpeech path. Then run:

./train.sh

This will automatically process the data and generate data/tts_sft_data.json.

Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in data_process/text_format_conversion.py) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker.

If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun train.sh.

After generating data/tts_sft_data.json, train.sh will automatically copy it to llama-factory/data and add the following field to dataset_info.json:

"tts_sft_data": {
    "file_name": "tts_sft_data.json"
}

Finally, it will automatically execute the llamafactory-cli train command to start training. You can adjust training settings using training/sft.yaml.

By default, the trained weights will be saved to pretrained_models/Muyan-TTS-new-SFT.

After training, you need to copy the sovits.pth of base/sft model to your trained model path before inference:

cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT

You can directly deploy your trained model using the API tool above. During inference, you need to specify the model_type to be sft and replace the ref_wav_path and prompt_text with a sample of the speaker's voice you trained on.

Acknowledgment

The model is trained base on Llama-3.2-3B.

We borrow a lot of code from GPT-SoVITS.

We borrow a lot of code from LLaMA-Factory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥🔥🔥 News!!

Summary

Framework

Data

Training costs

Synthesis speed

Demo

Install

Clone & Install

Model Download

Quickstart

API Usage

Training

Acknowledgment

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
data_process		data_process
inference		inference
sovits		sovits
training		training
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
api.py		api.py
prepare_sft_dataset.py		prepare_sft_dataset.py
requirements.txt		requirements.txt
train.sh		train.sh
tts.py		tts.py

License

MYZY-AI/Muyan-TTS

Folders and files

Latest commit

History

Repository files navigation

🔥🔥🔥 News!!

Summary

Framework

Data

Training costs

Synthesis speed

Demo

Install

Clone & Install

Model Download

Quickstart

API Usage

Training

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages