Stars
Awesome speech/audio LLMs, representation learning, and codec models
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
✨✨Latest Advances on Multimodal Large Language Models
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
A generative speech model for daily dialogue.
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
Inference and training library for high-quality TTS models.
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Official code for "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps" (Neurips 2022 Oral)
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Emu Series: Generative Multimodal Models from BAAI
Instant voice cloning by MIT and MyShell. Audio foundation model.
Pytorch0.4.1 codes for InsightFace
ACM MM 2021: 'Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection'
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
The official implementation of HierSpeech++
Diffusion model papers, survey, and taxonomy
Production First and Production Ready End-to-End Speech Recognition Toolkit
Denoising Diffusion Probabilistic Models
🤖 Assemble, configure, and deploy autonomous AI Agents in your browser.