Stars
The Ultimate Express. Fastest http server with full Express compatibility, based on µWebSockets.
Incredibly fast JavaScript runtime, bundler, test runner, and package manager – 8000 all in one
Your AI Operator for Web, Android, Automation & Testing.
Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
Enable LLMs to Program Themselves.
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
🦉 OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
Message passing between iOS apps and extensions.
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
A simple, high-quality voice conversion tool focused on ease of use and performance.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
This program processes anime videos and subtitles to create Text-to-Speech (TTS) datasets. It extracts and cleans the audio by removing background noise, then slices it into smaller segments. Final…
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
The easiest way to deploy agents, models, RAG, pipelines and more. No MLOps. No YAML.