Stars
Lightweight Python library for adding real-time multi-object tracking to any detector.
COLMAP - Structure-from-Motion and Multi-View Stereo
🍽️ Annotations for the public release of the EPIC-KITCHENS-100 dataset
Enhancing Zero-shot Image Retrieval with Vision Foundation Models
[ICML 2024] Official code repository for 3D embodied generalist agent LEO
BoT-SORT: Robust Associations Multi-Pedestrian Tracking
Code release for "Omni3D A Large Benchmark and Model for 3D Object Detection in the Wild"
[CVPR 2025 Best Paper Award Candidate] VGGT: Visual Geometry Grounded Transformer
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning
Code for "Open Vocabulary Monocular 3D Object Detection"
[ECCV24] Keypoint Promptable Re-Identification: SOTA ReID method robust to occlusions and multi-person ambiguity
Solve Visual Understanding with Reinforced VLMs
[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
MoVQGAN - model for the image encoding and reconstruction
Video Chain of Thought, Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
[CVPR2024] Code for "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation".
[CVPR 2024 Highlight] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
SpatialLM: Large Language Model for Spatial Understanding
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"