Stars
[CVPR'25 - Rating 555] Official PyTorch implementation of Lumos: Learning Visual Generative Priors without Text
CVPR and NeurIPS poster examples and templates. May we have in-person poster session soon!
Code implementation of paper "MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval (AAAI2025)"
[AAAI 25] Official Implementation for ”E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment“
[CVPR2022] PyTorch re-implementation of Prompt Distribution Learning
[NeurIPS2023] Neural-Logic Human-Object Interaction Detection
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Disentangled Pre-training for Human-Object Interaction Detection
PyTorch implementation of Sinusodial Representation networks (SIREN)
Code repository of the paper "CKConv: Continuous Kernel Convolution For Sequential Data" published at ICLR 2022. https://arxiv.org/abs/2102.02611
Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.
Open source implementation of "Vision Transformers Need Registers"
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
[ICLR 2025][arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Extend BoxDiff to SDXL (SDXL-based layout-to-image generation)
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
A collection of resources on controllable generation with text-to-image diffusion models.
Accepted as [NeurIPS 2024] Spotlight Presentation Paper
[AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding