8000 GitHub - afrizalhasbi/vits: Finetune VITS and MMS using HuggingFace's tools
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

afrizalhasbi/vits

 
 

Repository files navigation

git clone https://github.com/afrizalhasbi/vits
cd vits
uv pip install -r requirements.txt
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..
python convert_original_discriminator_checkpoint.py --language_code ind --pytorch_dump_folder_path mms_ind
accelerate launch run.py config.json

Inference

You can use a finetuned model via the Text-to-Speech (TTS) pipeline in just a few lines of code! Just replace ylacombe/vits_ljs_welsh_female_monospeaker_2 with your own model id (hub_model_id) or path to the model (output_dir).

from transformers import pipeline
import scipy

model_id = "ylacombe/vits_ljs_welsh_female_monospeaker_2"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU

speech = synthesiser("Hello, my dog is cooler than you!")

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])

Note that if your model needs to use uroman to train, you also should apply the uroman package to your text inputs prior to passing them to the pipeline:

import os
import subprocess
from transformers import pipeline
import scipy

model_id = "facebook/mms-tts-kor"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU

def uromanize(input_string, uroman_path):
    """Convert non-Roman strings to Roman using the `uroman` perl package."""
    script_path = os.path.join(uroman_path, "bin", "uroman.pl")

    command = ["perl", script_path]

    process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    # Execute the perl command
    stdout, stderr = process.communicate(input=input_string.encode())

    if process.returncode != 0:
        raise ValueError(f"Error {process.returncode}: {stderr.decode()}")

    # Return the output as a string and skip the new-line character at the end
    return stdout.decode()[:-1]

text = "이봐 무슨 일이야"
uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])

speech = synthesiser(uromanized_text)

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])

Acknowledgements

About

Finetune VITS and MMS using HuggingFace's tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Cython 0.4%
0