Stars
Unofficial description of the CUDA assembly (SASS) instruction sets.
Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc)
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Distributed Compiler based on Triton for Parallel Systems
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
how to optimize some algorithm in cuda.
Machine Learning Engineering Open Book
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient MLA decoding kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
My learning notes/codes for ML SYS.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉
Efficient Triton Kernels for LLM Training
FlashInfer: Kernel Library for LLM Serving