Stars
Official repo for From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
MINT-1T: A one trillion token multimodal interleaved dataset.
Modular, scalable library to train ML models
UniDisc: A discrete diffusion model for joint multimodal generation, enabling controllable and efficient text-image synthesis, editing, and inpainting.
[CVPR 2025] Science-T2I: Addressing Scientific Illusions in Image Synthesis
Code for "Scaling Language-Free Visual Representation Learning" paper (Web-SSL).
SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow
Code for Scaling Language-Free Visual Representation Learning (WebSSL)
Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning
tulip-berkeley / open_clip
Forked from mlfoundations/open_clipAn open source implementation of CLIP (With TULIP Support)
Official implementation of the Law of Vision Representation in MLLMs
Official PyTorch Implementation of "Diffusion Autoencoders are Scalable Image Tokenizers"
Easy no-frills Pytorch implementations of common abstractions for simple diffusion models.
Official repo and evaluation implementation of VSI-Bench
[ECCV 2024] Official Release of SILC: Improving vision language pretraining with self-distillation
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
PyTorch code and models for the DINOv2 self-supervised learning method.
[ECCV 2024] Code for VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Enabling PyTorch on XLA Devices (e.g. Google TPU)
Large-scale text-video dataset. 10 million captioned short videos.
PyTorch code and models for V-JEPA self-supervised learning from video.