MultiModal
[ECCV2024] This is an official implementation for "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model"
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highli…
CVPR'24, Official Codebase of our Paper: "Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation".
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion"
Emu Series: Generative Multimodal Models from BAAI
Lumina-T2X is a unified framework for Text to Any Modality Generation
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
Must-have resource for anyone who wants to experiment with and build on the OpenAI vision API 🔥
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
Towards Large Multimodal Models as Visual Foundation Agents
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding