Stars
🚀 Efficient implementations of state-of-the-art linear attention models
MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
[ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
[CVPR 2025] EgoLife: Towards Egocentric Life Assistant
A controlled benchmark on evaluating and studying the dynamics of Long Context Language Models
The official repo of the paper "MMLongBench Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly"
XiaomiMiMo / lmms-eval
Forked from EvolvingLMMs-Lab/lmms-evalAccelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
[ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.
Scaling Computer-Use Grounding via UI Decomposition and Synthesis
[ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically designed for fine-grained evaluation on customized scor…
LLM/VLM gaming agents and model evaluation through games.
Benchmark environment for evaluating vision-language models (VLMs) on popular video games!
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[ACM MM 2025] TimeChat-online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
A native-PyTorch library for large scale M-LLM (text/audio) training with tp/cp/dp/pp.
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
Performance analysis of predictive (alpha) stock factors