10000 GitHub - gwh22/UniVoice
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

gwh22/UniVoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniVoice: A Unified Speech Recognition and Synthesis Transformer with Autoregressive and Flow Matching Capabilities

Model Download | Quick Start | License | Citation
📄 Paper Link (UniVoice)

News

🚀 2025.03.30: The inference codes and checkpoints are released!

1. Introduction

This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

In this work, we use SmolLM2-360M-Instruct as the LLM backbone.

👨‍💻 Todo

  • Release UniVoice inference code
  • Release UniVoice checkpoints
  • UniVoice paper and demo
  • Release UniVoice training code

2. Model Download

Huggingface

Model Download
UniVoice-ASR 🤗 Hugging Face
UniVoice-TTS 🤗 Hugging Face
UniVoice-All 🤗 Hugging Face

NOTE: We now only trained a model on a 960hs LibriSpeech datatset, We will release a model trained with more data in the future.

3. Quick Start

Installation

On the basis of Python >= 3.9 environment, install the necessary dependencies by running the following command:

git clone https://github.com/gwh22/UniVoice
cd UniVoice
# We recommend using conda to create a new environment.
conda create -n UniVoice python=3.9
conda activate UniVoice
# install cuda >= 11.8
conda install cudatoolkit=11.8 -c nvidia

pip install -r requirements.txt

Inference

cd UniVoice
# for ASR task
sh scripts/univoice/infer_asr.sh
# for TTS task
sh scripts/univoice/infer_tts.sh

4. License

This code repository is licensed under the MIT License

5. Acknowledgments

This codebase borrows from DiT, SmolLM2-360M-Instruct, Monoformer, LLaVA, and Transformers. Thanks for their great works.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0