Lists (1)
Sort Name ascending (A-Z)
Stars
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curatio…
[ICASSP 2025] "Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention"
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
SpatialLM: Large Language Model for Spatial Understanding
🎥 Python and OpenCV-based scene cut/transition detection program & library.
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
[CVPR 2025 Best Paper Award Candidate] VGGT: Visual Geometry Grounded Transformer
[CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Repository for our paper: FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning, Proceedings of the 12th International Conference on Learning Representations (ICLR)
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
fit smpl parameters model using 3D joints
[CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Official implementation of "ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills"
Frontier Multimodal Foundation Models for Image and Video Understanding
Code release for "Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps"
The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Official implementation of the project HuDOR: Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards project. Website: https://object-rewards.github.io
[RSS 2024] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
The simplest, fastest repository for training/finetuning medium-sized GPTs.
GRAPE: Guided-Reinforced Vision-Language-Action Preference Optimization
Code for the paper "Learning from Massive Human Videos for Universal Humanoid Pose Control"