Stars
SGLang is a fast serving framework for large language models and vision language models.
https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
A latent text-to-image diffusion model
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Optimized primitives for collective multi-GPU communication
how to optimize some algorithm in cuda.
使用 cutlass 实现 flash-attention 精简版,具有教学意义
Fast and memory-efficient exact attention
flash attention tutorial written in python, triton, cuda, cutlass
Flash Attention in ~100 lines of CUDA (forward pass only)
FlashInfer: Kernel Library for LLM Serving
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
This repository contains integer operators on GPUs for PyTorch.
Fast inference from large lauguage models via speculative decoding
TensorRT Examples (TensorRT, Jetson Nano, Python, C++)