Highlights
- Pro
Stars
Examples of CUDA implementations by Cutlass CuTe
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
Tile primitives for speedy kernels
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
flash attention tutorial written in python, triton, cuda, cutlass
A minimal GPU design in Verilog to learn how GPUs work from the ground up
✨ Light and Fast AI Assistant. Support: Web | iOS | MacOS | Android | Linux | Windows
This is a Chinese translation of the CUDA programming guide
A high-throughput and memory-efficient inference and serving engine for LLMs
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
FlashInfer: Kernel Library for LLM Serving
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
CUDA and Triton implementations of Flash Attention with SoftmaxN.
鉴定网络热门并行编程框架 - 性能测评(附小彭老师锐评)已评测:Taichi、SyCL、C++、OpenMP、TBB、Mojo