Lists (1)
Sort Name ascending (A-Z)
Stars
Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
[CVPR 2024] DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
[CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Code release for 'Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models'
A tool for creating and running Linux containers using lightweight virtual machines on a Mac. It is written in Swift, and optimized for Apple silicon.
The mouse and trackpad utility for Mac.
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
Segment Anything in Medical Images
Official repository for "AM-RADIO: Reduce All Domains Into One"
Official Repo for "TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding" [ACL 2025 oral]
The simplest, fastest repository for training/finetuning small-sized VLMs.
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
The official code of "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning"
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
[CVPR 2025] EgoLife: Towards Egocentric Life Assistant
A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
[CVPR 2025] The code for paper ''Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding''.
Official implementation of "Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness".
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framewor 4332 k based on veRL
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models