Stars
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
Virtual whiteboard for sketching hand-drawn like diagrams
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
FlashInfer: Kernel Library for LLM Serving
Triton-based implementation of Sparse Mixture of Experts.
Radial Attention Official Implementation
Github mirror of trition-lang/triton repo.
Introduction to Parallel Programming class code
MAGI-1: Autoregressive Video Generation at Scale
A collection of memory efficient attention operators implemented in the Triton language.
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Research project on scaling GPU-accelerated data management to large data volumes. Code base of two SIGMOD papers.
Distributed Compiler based on Triton for Parallel Systems
[CVPR 2025 Highlight] TinyFusion: Diffusion Transformers Learned Shallow
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
[ICML2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
[NeurIPS 2023] Structural Pruning for Diffusion Models
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
Analyze computation-communication overlap in V3/R1.