Stars
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
A generative world for general-purpose robotics & embodied AI learning.
A paper list of some recent works about Token Compress for Vit and VLM
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
[CVPR 2024] DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating
AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.
Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
Chat凉宫春日, An open sourced Role-Playing chatbot Cheng Li, Ziang Leng, and others.
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
Official implementation of Think Global, Act Local: Dual-scale GraphTransformer for Vision-and-Language Navigation (CVPR'22 Oral).
A data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).