Stars
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
Official implementation of UnifiedReward & UnifiedReward-Think
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
SGLang is a fast serving framework for large language models and vision language models.
A linear estimator on top of clip to predict the aesthetic quality of pictures
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Code for paper "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning"
Minimal reproduction of DeepSeek R1-Zero
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs
Train transformer language models with reinforcement learning.
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding
Janus-Series: Unified Multimodal Understanding and Generation Models
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Implementation code of the paper MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing
Benchmark for generative image models
Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation