Stars
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Autoregressive Image Generation with Randomized Parallel Decoding
[CVPR 2025 (Oral)] Open implementation of "RandAR"
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
PyTorch - FID calculation with proper image resizing and quantization steps [CVPR 2022]
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Official repo for: SuperEdit - Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".
DeerFlow is a community-driven Deep Research framework, combining language models with tools like web search, crawling, and Python execution, while contributing back to the open-source community.
An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
The simplest, fastest repository for training/finetuning small-sized VLMs.
Official implementation of Character Region Awareness for Text Detection (CRAFT)
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
[Neurips 2023 & TPAMI] T2I-CompBench (++) for Compositional Text-to-image Generation Evaluation
Official repository of T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Minimalistic 4D-parallelism distributed training framework for education purpose
Official PyTorch implementation of FlowMo.
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
MAGI-1: Autoregressive Video Generation at Scale
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos