Stars
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
a small build system with a focus on speed
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily write your own.
IREE's PyTorch Frontend, based on Torch Dynamo.
A retargetable MLIR-based machine learning compiler and runtime toolkit.
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
Large Language Model Text Generation Inference
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
An extremely fast Python package and project manager, written in Rust.
[TMLR 2024] Efficient Large Language Models: A Survey
⏩ Create, share, and use custom AI code assistants with our open-source IDE extensions and hub of models, rules, prompts, docs, and other building blocks
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
FlashInfer: Kernel Library for LLM Serving
Tools for merging pretrained large language models.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.