Stars
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning
Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
[CVPR 2025] "A Distractor-Aware Memory for Visual Object Tracking with SAM2"
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
[CVPR25] Official repository for the paper: "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation"
UFM: A Unified Dense Image Correspondence Estimator for both Optical Flow & Wide Baseline Matching Tasks. Matches any pair of images.
About This repository is a curated collection of the most exciting and influential CVPR 2025 papers. 🔥 [Paper + Code + Demo]
an open source, extensible AI agent that goes beyond code suggestions - install, execute, edit, and test with any LLM
✨ An advanced 3D Gaussian Splatting renderer for THREE.js
[CVPR 2025] Official PyTorch implementation of "EdgeTAM: On-Device Track Anything Model"
Official repository for "Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment"
The official code implementation of the paper "OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data."
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[SIGGRAPH 2025] PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Removes reflections quickly and easily.
A unified library for object tracking featuring clean room re-implementations of leading multi-object tracking algorithms
[CVPR 2025] UniK3D: Universal Camera Monocular 3D Estimation
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
RF-DETR is a real-time object detection model architecture developed by Roboflow, SOTA on COCO & designed for fine-tuning.
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
Streamlined interface for generating images with AI in Krita. Inpaint and outpaint with optional text prompt, no tweaking required.
The official Python SDK for Model Context Protocol servers and clients
🚀 The fast, Pythonic way to build MCP servers and clients
Official examples and tools from the JACK project
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents