Starred repositories
GPT-4o-level, real-time spoken dialogue system.
Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)
[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
Stream-Omni is a GPT-4o-like language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.
A lightweight LMM-based Document Parsing Model
From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Fully Local Manus AI. No APIs, No $200 monthly bills. Enjoy an autonomous agent that thinks, browses the web, and code for the sole cost of electricity. 🔔 Official updates only via twitter @Martin9…
✨✨VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
SkyReels-A2: Compose anything in video diffusion transformers
(CVPR 2025) From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
MCP: Build Rich-Context AI Apps with Anthropic
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Get started with building Fullstack Agents using Gemini 2.5 and LangGraph
Just another reasonably minimal repo for class-conditional training of pixel-space diffusion transformers.
The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
Official MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
🤖 A visualization Model Context Protocol server for generating 25+ visual charts using @antvis.