-
The Chinese University of Hong Kong
- Hong Kong
-
10:52
(UTC +08:00) - https://harryhsing.github.io/
- in/xingzhenghao
- @onehsing
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
[ICML 2025] PyTorch Implementation of "OmniAudio: Generating Spatial Audio from 360-Degree Video"
[CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generation
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]
EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights (CVPR 2025)
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Train transformer language models with reinforcement learning.
This is for ACL 2025 Findings Paper: From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalitiesModels
DeerFlow is a community-driven Deep Research framework, combining language models with tools like web search, crawling, and Python execution, while contributing back to the open-source community.
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
AudioBench: A Universal Benchmark for Audio Large Language Models
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
Lets make video diffusion practical!
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
[arXiv 2025] Efficient Reasoning Models: A Survey
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
An Open-source RL System from ByteDance Seed and Tsinghua AIR
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!