Highlights
- Pro
Stars
Distributed Triton for Parallel Systems
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Mihomo CLI client on Linux. Formerly `clashrup`.
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
IREE's PyTorch Frontend, based on Torch Dynamo.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Efficient Triton Kernels for LLM Training
SGLang is a fast serving framework for large language models and vision language models.
A PyTorch native platform for training generative AI models
A collection of out-of-tree LLVM passes for teaching and learning
PyTorch native quantization and sparsity for training and inference
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
A model compilation solution for various hardware
ncnn is a high-performance neural network inference framework optimized for the mobile platform
A simple high performance CUDA GEMM implementation.
北航计算机学院本科《编译原理》实验课的大作业。源语言为类PASCAL语言,目标语言为x86汇编,编译器用C++语言实现。