UniVoice: A Unified Speech Recognition and Synthesis Transformer with Autoregressive and Flow Matching Capabilities

Model Download | Quick Start | License | Citation
📄 Paper Link (UniVoice)

News

🚀 2025.03.30: The inference codes and checkpoints are released!

1. Introduction

This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

In this work, we use SmolLM2-360M-Instruct as the LLM backbone.

👨‍💻 Todo

Release UniVoice inference code
Release UniVoice checkpoints
UniVoice paper and demo
Release UniVoice training code

2. Model Download

Huggingface

Model	Download
UniVoice-ASR	🤗 Hugging Face
UniVoice-TTS	🤗 Hugging Face
UniVoice-All	🤗 Hugging Face

NOTE: We now only trained a model on a 960hs LibriSpeech datatset, We will release a model trained with more data in the future.

3. Quick Start

Installation

On the basis of Python >= 3.9 environment, install the necessary dependencies by running the following command:

git clone https://github.com/gwh22/UniVoice
cd UniVoice
# We recommend using conda to create a new environment.
conda create -n UniVoice python=3.9
conda activate UniVoice
# install cuda >= 11.8
conda install cudatoolkit=11.8 -c nvidia

pip install -r requirements.txt

Inference

cd UniVoice
# for ASR task
sh scripts/univoice/infer_asr.sh
# for TTS task
sh scripts/univoice/infer_tts.sh

4. License

This code repository is licensed under the MIT License

5. Acknowledgments

This codebase borrows from DiT, SmolLM2-360M-Instruct, Monoformer, LLaVA, and Transformers. Thanks for their great works.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BigVGAN		BigVGAN
assests		assests
data		data
infers		infers
scripts/univoice		scripts/univoice
univoice		univoice
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniVoice: A Unified Speech Recognition and Synthesis Transformer with Autoregressive and Flow Matching Capabilities

News

1. Introduction

👨‍💻 Todo

2. Model Download

Huggingface

3. Quick Start

Installation

Inference

4. License

5. Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

gwh22/UniVoice

Folders and files

Latest commit

History

Repository files navigation

UniVoice: A Unified Speech Recognition and Synthesis Transformer with Autoregressive and Flow Matching Capabilities

News

1. Introduction

👨‍💻 Todo

2. Model Download

Huggingface

3. Quick Start

Installation

Inference

4. License

5. Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages