Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.
- April 29, 2025: π We release the zero-shot TTS model weights of Muyan-TTS.
- April 29, 2025: π We release the few-shot TTS model weights of Muyan-TTS-SFT, which is trained based on Muyan-TTS with dozens of minutes of a single speaker's speech.
- April 29, 2025: π We release the training code from the base model to the SFT model for speaker adaptation.
- April 29, 2025: π We release the technical report of Muyan-TTS.
Framework of Muyan-TTS. Left is an LLM that models the parallel corpus of text (in blue) and audio (in green) tokens. Right is a SoVITS model that decodes the generated audio tokens, as well as phonemes and speaker embeddings, into the audio waveform.
Data processing pipeline. The final dataset comprises over 100,000 hours of high-quality speech and corresponding transcriptions, forming a robust parallel corpus suitable for TTS training in long-form audio scenarios such as podcasts.
Training Cost | Data Processing | Pre-training of LLM | Training of Decoder | Total |
---|---|---|---|---|
in GPU Hours | 60K(A10) | 19.2K(A100) | 1.34K(A100) | - |
in USD | $30K | $19.2K | $1.34K | $50.54K |
Training costs of Muyan-TTS, assuming the rental price of A10 and A100 in GPU hour is $0.5 and $1, respectively.
We denote r
as the inference time needed to generate one second of audio and compare the synthesis speed with several open-source TTS models.
Model | CosyVoice2 | Step-Audio | Spark-TTS | FireRedTTS | GPT-SoVITS v3 | Muyan-TTS |
---|---|---|---|---|---|---|
r β | 2.19 | 0.90 | 1.31 | 0.61 | 0.48 | 0.33 |
All the inference process ran on a single NVIDIA A100 (40G, PCIe) GPU, and the baseline models were evaluated using their official inference implementations.
Note: Muyan-TTS only supports English input since the training data is heavily skewed toward English.
demo.mov
The three audios in the "Base model" column and the first audio in the "SFT model" column are synthesized by the open-sourced Muyan-TTS and Muyan-TTS-SFT, respectively. The last two audios in the "SFT model" column are generated by the SFT models trained separately on the base model, which are not open for use.
git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS
conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build
You need to install FFmpeg
. If you're using Ubuntu, you can install it with the following command:
sudo apt update
sudo apt install ffmpeg
Models | Links |
---|---|
Muyan-TTS | huggingface | modelscope |
Muyan-TTS-SFT | huggingface | modelscope |
Additionally, you need to download the weights of chinese-hubert-base.
Place all the downloaded models in the pretrained_models
directory. Your directory structure should look similar to the following:
pretrained_models
βββ chinese-hubert-base
βββ Muyan-TTS
βββ Muyan-TTS-SFT
python tts.py
This will synthesize speech through inference. The core code is as follows:
async def main(model_type, model_path):
tts = Inference(model_type, model_path, enable_vllm_acc=False)
wavs = await tts.generate(
ref_wav_path="assets/Claire.wav",
prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
)
output_path = "logs/tts.wav"
with open(output_path, "wb") as f:
f.write(next(wavs))
print(f"Speech generated in {output_path}")
You need to specify the prompt speech, including the ref_wav_path
and its prompt_text
, and the
8AC5
text
to be synthesized. The synthesized speech is saved by default to logs/tts.wav
.
Additionally, you need to specify model_type
as either base
or sft
, with the default being base
.
When you specify the model_type
to be base
, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.
When you specify the model_type
to be sft
, you need to keep the prompt speech unchanged because the sft
model is trained on Claire's voice.
python api.py
Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port 8020
. Additionally, LLM logs will be saved in logs/llm.log
.
Similarly, you need to specify model_type
as either base
or sft
, with the default being base
. Note that the model_path
should be consistent with your specified model_type
.
You can send a request to the API using the example below:
import time
import requests
TTS_PORT=8020
payload = {
"ref_wav_path": "assets/Claire.wav",
"prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
"text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together.",
"temperature": 0.6,
"speed": 1.0,
}
start = time.time()
url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
f.write(response.content)
print(time.time() - start)
By default, the synthesized speech will be saved at logs/tts.wav
.
We use LibriSpeech
as an example. You can use your own dataset instead, but you need to organize the data into the format shown in data_process/examples
.
If you haven't downloaded LibriSpeech
yet, you can download the dev-clean set using:
wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz
After uncompressing the data, specify the librispeech_dir
in prepare_sft_dataset.py
to be the parent folder of your LibriSpeech
path. Then run:
./train.sh
This will automatically process the data and generate data/tts_sft_data.json
.
Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in data_process/text_format_conversion.py
) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker.
If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun train.sh
.
After generating data/tts_sft_data.json
, train.sh will automatically copy it to llama-factory/data
and add the following field to dataset_info.json
:
"tts_sft_data": {
"file_name": "tts_sft_data.json"
}
Finally, it will automatically execute the llamafactory-cli train
command to start training. You can adjust training settings using training/sft.yaml
.
By default, the trained weights will be saved to pretrained_models/Muyan-TTS-new-SFT
.
After training, you need to copy the sovits.pth
of base/sft model to your trained model path before inference:
cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT
You can directly deploy your trained model using the API tool above. During inference, you need to specify the model_type
to be sft
and replace the ref_wav_path
and prompt_text
with a sample of the speaker's voice you trained on.
The model is trained base on Llama-3.2-3B.
We borrow a lot of code from GPT-SoVITS.
We borrow a lot of code from LLaMA-Factory.