[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,024 53 Updated May 12, 2025

parallel101 / cppguidebook

小彭老师领衔编写，现代C++的中文百科全书

Typst 871 65 Updated Mar 29, 2025

archibate / hermes

Microbenchmark framework

C++ 4 Updated Sep 5, 2024

microsoft / GSL

Guidelines Support Library

C++ 6,406 747 Updated May 12, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,352 142 Updated May 16, 2025

liguodongiot / llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

HTML 17,480 2,046 Updated May 1, 2025

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 353 35 Updated May 14, 2025

adam-maj / tiny-gpu

A minimal GPU design in Verilog to learn how GPUs work from the ground up

SystemVerilog 8,347 638 Updated Aug 18, 2024

ChatGPTNextWeb / NextChat < 8000 /a>

TypeScript 83,375 61,173 Updated Apr 19, 2025

HeKun-NVIDIA / CUDA-Programming-Guide-in-Chinese

This is a Chinese translation of the CUDA programming guide

1,538 237 Updated Nov 13, 2024

reed-lau / cute-gemm

C++ 119 33 Updated Dec 6, 2024

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 47,505 7,451 Updated May 17, 2025

IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,104 168 Updated Mar 27, 2024

Yinghan-Li / YHs_Sample

Yinghan's Code Sample

Cuda 327 58 Updated Jul 25, 2022

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,971 308 Updated May 17, 2025

lenLRX / AmpereSparseMatmul

study of Ampere' Sparse Matmul

Cuda 18 5 Updated Jan 10, 2021

Mq-b / Loser-HomeWork

卢瑟们的作业展示，答案讲解，以及一些C++知识

C++ 720 138 Updated Apr 17, 2025

NVIDIA / CUDALibrarySamples

CUDA Library Samples

Cuda 1,933 389 Updated May 12, 2025

parallel101 / stl1weekend

Build your own STL in one weekend

C++ 280 21 Updated Dec 18, 2024

QwenLM / qwen.cpp

C++ implementation of Qwen-LM

C++ 587 52 Updated Dec 6, 2024

FasterDecoding / Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,525 175 Updated Jun 25, 2024

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 411 79 Updated Sep 8, 2024

softmax1 / Flash-Attention-Softmax-N

CUDA and Triton implementations of Flash Attention with SoftmaxN.

Python 70 5 Updated May 26, 2024

MegEngine / MegCC

MegCC是一个运行时超轻量，高效，移植简单的深 47F3 度学习模型编译器

C++ 484 58 Updated Oct 23, 2024

archibate / sycltutor

小彭老师推出 SyCL 2020 课程（施工中，日后会在直播中放出）

C++ 15 Updated Sep 3, 2023

archibate / parallel-languages-benchmark

鉴定网络热门并行编程框架 - 性能测评（附小彭老师锐评）已评测：Taichi、SyCL、C++、OpenMP、TBB、Mojo

C++ 35 Updated Aug 28, 2023

archibate / minilog

Mini Logging Library with C++20 education purpose

C++ 31 1 Updated Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rightchose

Achievements