Description
First of all, thank you very much for open-sourcing VoiceStar, it is a great work. Our ideas for improvements based on VoiceCraft are very similar to what you have posted, and your improvements are more than what we have tried. Since the paper has not been published yet, after looking at the code, there are a few questions I would like to ask.
1. Is the key to improving the stability of the model because of the enc-dec structure?
The biggest update of the model is to change the decoder only structure to enc-dec and use rope as positional embedding, as many other state-of-art TTS system use enc-dec + corss attention as the main structure, is this the key point to improve model stability? And will it slow down synthesis speed (Is it much slower than decoder only?)
2. Does the release version support multilingual synthesis?
As . /data
folder contains Emilia dataset, and tokenizer.py
has PypinyinBackend
function, but when I test the gradio script, the model doesn't synthesize valid Chinese, so I'd like to ask if the model released at this stage supports languages other than English?
3. Is the model compatible with speech editing?
Since the option ttsonly
was added to config.py
, and the editing part hasn't been removed (just added if else). I would like to ask if the model is a pure TTS model or does it support both TTS and speech editing?
4. Is it useful to add extra losses like ctc_loss, dur_loss, entropy_loss?
I saw these codes in trainer.py
but didn't find its implementation elsewhere, meanwhile I have tried ctc loss in my own experiments but I feel it doesn't have much enhancement, have these extra added losses been experimented?
5. There seems to be a bug in target_duration
setting in the gradio test.
When I run inference_gradio.py
, if I don't enter target_duration
, no matter how long the Target Text
is, the target_generation_length
is fixed to the value of prompt audio, which results in missing or duplicate content in the synthesis result.
(p.s. Regarding solving the repetition problem, in our tests, adding repetition penalty to the sampling process improves a lot)
6. About few settings in config.py
In config.py
, there are also a few settings that are not used elsewhere, such as the musicgen, valle and LFSC codecs. Have these baselines or modules been added to voicestar?
7. Will you open the test scripts for evaluation and comparison?
Here seems to contain test code, parser.add_argument(“--metrics”, type=str, default="[‘spk_sim’,'wer',‘mcd’,'pitch',‘energy’,'pesq',‘utmos’]")
Does this correspond to the test script, and could this be made open source for comparison and experimentation?
Looking forward to your answer and awaiting the release of the paper!