Stars
Streamable Text-to-Speech model using a language modeling approach, without vector quantization
A Low-Latency, Lightweight and High-Performance Streaming VAD
Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! 🦥
Self-supervised Generative LM-based Voice Conversion
Have a natural, spoken conversation with AI!
A benchmark to evaluate full-duplex spoken dialogue models on pause handling, backchanneling, turn-taking, and user interruptions.
All generative model in one for better TTS model
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
A TTS model capable of generating ultra-realistic dialogue in one pass.
Fine-tuning Moshi/J-Moshi on your own spoken dialogue data
PyTorch Implementation of AudioLCM (ACM-MM'24): a efficient and high-quality text-to-audio generation with latent consistency model.
OSUM: Open Speech Understanding Model, open-sourced by ASLP@NPU.
ZhikangNiu / LLaSA_training
Forked from zhenye234/LLaSA_trainingLLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
finetune llm part for spark-tts model
Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Pytorch
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and per 5DDD forming real-time speech generation.
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Use any LLMs (Large Language Models) for Deep Research. Support SSE API and MCP server.
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'