Stars
This repo contains documentation and code needed to use PACO dataset: data loaders and training and evaluation scripts for objects, parts, and attributes prediction models, query evaluation scripts…
Radial Attention Official Implementation
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Align Anything: Training All-modality Model with Feedback
Open-source Multi-agent Poster Generation from Papers
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Official Implementation of Diffusion Step Annealing (DiSA) in Autoregressive Image Generation
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
Janus-Series: Unified Multimodal Understanding and Generation Models
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
The simplest, fastest repository for training/finetuning small-sized VLMs.