-
horizon robotics
- nanjing, china
More
Stars
Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
a family of versatile and state-of-the-art video tokenizers.
VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
Video Diffusion Alignment via Reward Gradients. We improve a variety of video diffusion models such as VideoCrafter, OpenSora, ModelScope and StableVideoDiffusion by finetuning them using various r…
Code for "Diffusion Model Alignment Using Direct Preference Optimization"
DDPO for finetuning diffusion models, implemented in PyTorch with LoRA support
A curated list of Diffusion Model in RL resources (continually updated)
Wan: Open and Advanced Large-Scale Video Generative Models
Code for Scaling Language-Free Visual Representation Learning (WebSSL)
Awesome RL Reasoning Recipes ("Triple R")
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
[NeurIPS 2024 Datasets and Benchmarks Track] Closed-Loop E2E-AD Benchmark Enhanced by World Model RL Expert
Understanding R1-Zero-Like Training: A Critical Perspective
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it's combining the best of RN…
The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & RFT & Dynamic Sampling & Async Agent RL)
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Solve Visual Understanding with Reinforced VLMs
Witness the aha moment of VLM with less than $3.
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Fully open reproduction of DeepSeek-R1
Official Pytorch Implementation for "VidToMe: Video Token Merging for Zero-Shot Video Editing" (CVPR 2024)
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations
A method to increase the speed and lower the memory footprint of existing vision transformers.