Model Download |
Quick Start |
License |
Citation
📄 Paper Link (UniVoice)
🚀 2025.03.30: The inference codes and checkpoints are released!
This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.
In this work, we use SmolLM2-360M-Instruct as the LLM backbone.
- Release UniVoice inference code
- Release UniVoice checkpoints
- UniVoice paper and demo
- Release UniVoice training code
Model | Download |
---|---|
UniVoice-ASR | 🤗 Hugging Face |
UniVoice-TTS | 🤗 Hugging Face |
UniVoice-All | 🤗 Hugging Face |
NOTE: We now only trained a model on a 960hs LibriSpeech datatset, We will release a model trained with more data in the future.
On the basis of Python >= 3.9
environment, install the necessary dependencies by running the following command:
git clone https://github.com/gwh22/UniVoice
cd UniVoice
# We recommend using conda to create a new environment.
conda create -n UniVoice python=3.9
conda activate UniVoice
# install cuda >= 11.8
conda install cudatoolkit=11.8 -c nvidia
pip install -r requirements.txt
cd UniVoice
# for ASR task
sh scripts/univoice/infer_asr.sh
# for TTS task
sh scripts/univoice/infer_tts.sh
This code repository is licensed under the MIT License
This codebase borrows from DiT, SmolLM2-360M-Instruct, Monoformer, LLaVA, and Transformers. Thanks for their great works.