- Seattle, Washington, United States
- weikaih04.github.io
Highlights
- Pro
Stars
QwQ is the reasoning model series developed by Qwen team, Alibaba Cloud.
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)
[CVPR 2024 Highlight] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
[CVPR 2025] Any6D: Model-free 6D Pose Estimation of Novel Objects
[CVPR2024] Code for "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation".
A data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
Official PyTorch Implementation for "Stereo3DMOT: Stereo Vision Based 3D Multi-Object Tracking with Multimodal ReID, PRCV2023"
A curated list of awesome Deep Stereo Matching resources
The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
Official implementation of Continuous 3D Perception Model with Persistent State
We extend Segment Anything to 3D perception by combining it with VoxelNeXt.
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis
Transparent Image Layer Diffusion using Latent Transparency
YOLO 3D Object Detection for Autonomous Driving Vehicle
Code for "Open Vocabulary Monocular 3D Object Detection"
[ICLR'25] 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
🚀🚀🚀A curated list of papers on controllable video generation.
Official implementation of "Generating images with 3D annotations using diffusion models".
👆Pytorch implementation of "Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion"
Stereo4D dataset and processing code
Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning
A simple screen parsing tool towards pure vision based GUI agent