git clone https://github.com/afrizalhasbi/vits
cd vits
uv pip install -r requirements.txt
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..
python convert_original_discriminator_checkpoint.py --language_code ind --pytorch_dump_folder_path mms_ind
accelerate launch run.py config.json
You can use a finetuned model via the Text-to-Speech (TTS) pipeline in just a few lines of code!
Just replace ylacombe/vits_ljs_welsh_female_monospeaker_2
with your own model id (hub_model_id
) or path to the model (output_dir
).
from transformers import pipeline
import scipy
model_id = "ylacombe/vits_ljs_welsh_female_monospeaker_2"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
speech = synthesiser("Hello, my dog is cooler than you!")
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
Note that if your model needs to use uroman
to train, you also should apply the uroman package to your text inputs prior to passing them to the pipeline:
import os
import subprocess
from transformers import pipeline
import scipy
model_id = "facebook/mms-tts-kor"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
def uromanize(input_string, uroman_path):
"""Convert non-Roman strings to Roman using the `uroman` perl package."""
script_path = os.path.join(uroman_path, "bin", "uroman.pl")
command = ["perl", script_path]
process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# Execute the perl command
stdout, stderr = process.communicate(input=input_string.encode())
if process.returncode != 0:
raise ValueError(f"Error {process.returncode}: {stderr.decode()}")
# Return the output as a string and skip the new-line character at the end
return stdout.decode()[:-1]
text = "이봐 무슨 일이야"
uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
speech = synthesiser(uromanized_text)
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
- VITS was proposed in 2021, in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech by Jaehyeon Kim, Jungil Kong, Juhee Son. You can find the original codebase here.
- MMS was proposed in Scaling Speech Technology to 1,000+ Languages by Vineel Pratap, Andros Tjandra, Bowen Shi and co. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts.
- Hugging Face 🤗 Transformers for the model integration, Hugging Face 🤗 Accelerate for the distributed code and Hugging Face 🤗 datasets for facilitating datasets access.
- @nivibilla's adapation of HifiGan's discriminator, used for English VITS training.