AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,656 383 Updated Apr 1, 2025

SzymonOzog / FastSoftmax

Cuda 47 6 Updated Jan 6, 2025

fattorib / CudaSoftmax

Softmax CUDA kernel :)

Cuda 3 Updated Apr 20, 2024

mc-imperial / gpuverify

GPUVerify: a Verifier for GPU Kernels

C# 63 16 Updated Jul 28, 2022

ucb-bar / virgo

Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC

Scala 35 1 Updated Jun 12, 2025

hernanponcedeleon / Dat3M

A verification tool for many memory models

Java 96 31 Updated Jul 18, 2025

seb-v / fp32_sgemm_amd

Super fast FP32 matrix multiplication on RDNA3

Assembly 68 2 Updated Mar 30, 2025

tinygrad / 7900xtx

Python 448 31 Updated Apr 6, 2025

ChipsandCheese / gpuperftests

Derived from Nemes' gpuperftests

C++ 30 5 Updated Jul 11, 2024

reeselevine / webgpu-litmus

WGSL 14 1 Updated Dec 8, 2023

JohndeVostok / APE

A GPU FP32 computation method with Tensor Cores.

C++ 21 3 Updated Nov 20, 2022

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,523 160 Updated Jul 15, 2025

tue-es / gpu-cache-model

A GPU cache model for research purposes

C++ 28 7 Updated Nov 4, 2013

microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats

Python 259 34 Updated Jun 18, 2025

baidu-research / DeepBench

Benchmarking Deep Learning operations on different hardware

C++ 1,091 237 Updated Apr 25, 2021

0x5ec1ab / gpu-tlb

C++ 64 18 Updated Apr 18, 2025

Faraz9877 / H100_GEMM

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

Cuda 8 Updated Dec 4, 2024

georgia-tech-synergy-lab / SIGMA

RTL implementation of Flex-DPE.

Verilog 106 29 Updated Feb 22, 2020

mikolalysenko / angle

Almost Native Graphics Layer Engine (local fork)

C++ 27 24 Updated Jul 25, 2023

PeterTh / uCLbench

Set of OpenCL microbenchmarks

C++ 29 10 Updated Feb 18, 2024

RippeR37 / GL_vs_VK

Comparison of OpenGL and Vulkan API in terms of performance.

C++ 85 10 Updated Jul 8, 2022

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

C++ 342 61 Updated Jul 19, 2025

ColfaxResearch / cfx-article-src

C++ 124 27 Updated May 7, 2025

Ratbuyer / h100-features

Cuda 13 7 Updated Mar 12, 2025

Apress / data-parallel-CPP

Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…

CMake 275 87 Updated Mar 26, 2025

linebender / vello

A GPU compute-centric 2D renderer.

Rust 2,985 164 Updated Jul 18, 2025

hahnyuan / LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 511 60 Updated Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matlinsas matlinsas

Block or report matlinsas

Lists (1)

GPU-Related

Stars

OSU-STARLAB / UVM_benchmark

adamtiger / tinyGPUlang

0xD0GF00D / DocumentSASS

facebookincubator / AITemplate