Stars
Official implementation of our ICML 2024 paper "UP2ME: Univariate Pre-training to Multivariate Fine-tuning as a General-purpose Framework for Multivariate Time Series Analysis"
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
MM-IFEngine: Towards Multimodal Instruction Following
Code for "A Sober Look at Progress in Language Model Reasoning" paper
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Official implementation of HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
Official implementation of UnifiedReward & UnifiedReward-Think
Official repository of ’Visual-RFT: Visual Reinforcement Fine-Tuning’
Official Repository of paper OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Customize your arXiv recommendation every day.
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Witness the aha moment of VLM with less than $3.
Scalable RL solution for advanced reasoning of language models
[ICML 2025 Spotlight] An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
official code for "BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning"
[CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Single Image Dehazing Using Scene Depth Ordering
This is the official code for the paper Tailor3D
[ NeurIPS 2024 D&B Track ] Implementation for "FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models"
Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models