- Shenzhen Guangdong China
- https://www.tsinghua.edu.cn/
Highlights
- Pro
Stars
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
[ArXiv 2025] WorldMem: Long-term Consistent World Simulation with Memory
Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"
Domain Generalization through Distilling CLIP with Language Guidance
Knowledge Distillation using Contrastive Language-Image Pretraining (CLIP) without a teacher model.
(TMM 2025) Official repository of paper "A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection"
[NeurIPS'24] Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy
The official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)
🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
Simple large-scale training of stable diffusion with multi-node support.
GenEval: An object-focused framework for evaluating text-to-image alignment
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparen…
Rare-to-Frequent (R2F), ICLR'25, Spotlight
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
[ICCV 2025] LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
[Neurips 2023 & TPAMI] T2I-CompBench (++) for Compositional Text-to-image Generation Evaluation
Using Low-rank adaptation to quickly fine-tune diffusion models.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
A One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks