- Seoul, Korea
- https://avocado136.github.io/
- in/kim-nguyen136
Stars
Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
✨✨VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Voice Agent Framework for Conversational AI
A powerful framework for building realtime voice AI agents 🤖🎙️📹
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
A TTS model capable of generating ultra-realistic dialogue in one pass.
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Open Source framework for voice and multimodal conversational AI
A Conversational Speech Generation Model
Unified automatic quality assessment for speech, music, and sound.
A Python package that makes it easy for developers to create AI apps powered by various AI providers.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
A generative speech model for daily dialogue.
This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Code for the paper "LLark: A Multimodal Instruction-Following Language Model for Music" by Josh Gardner, Simon Durand, Daniel Stoller, and Rachel Bittner.
Awesome speech/audio LLMs, representation learning, and codec models
A Beautiful Private and Secure Desktop Investment Tracking Application
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"