-
Tencent ARC Lab
- https://geyuying.github.io/
Stars
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
A post-training method to enhance CLIP's fine-grained visual representations with generative models.
[Arxiv'25] BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
✨First Open-Source R1-like Video-LLM [2025/02/18]
SALMONN family: A suite of advanced multi-modal LLMs
VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
A suite of image and video neural tokenizers
Latent Motion Token as the Bridging Language for Robot Manipulation
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution