Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
FlashInfer: Kernel Library for LLM Serving
Radial Attention Official Implementation
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
My annotated papers and meeting recordings for the EleutherAI ML Performance research paper reading group
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
[NeurlPS2024] One-Step Effective Diffusion Network for Real-World Image Super-Resolution
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
(CVPR 2025) From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
FastVideo is a unified framework for accelerated video generation.
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
A pipeline parallel training script for diffusion models.
Pocket Flow: Codebase to Tutorial
A TTS model capable of generating ultra-realistic dialogue in one pass.
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× vs cuBLAS
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…
Self-contained, minimalistic implementation of diffusion models with Pytorch.
The ultimate training toolkit for finetuning diffusion models
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
[CVPR 2024] DeepCache: Accelerating Diffusion Models for Free
Accessible large language models via k-bit quantization for PyTorch.
OneDiff: An out-of-the-box acceleration library for diffusion models.
Applied AI experiments and examples for PyTorch
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
Combining Teacache with xDiT to Accelerate Visual Generation Models
XAttention: Block Sparse Attention with Antidiagonal Scoring
https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching