matlinsas

matlinsas matlinsas

3 followers · 2 following

seu
china,nanjing

Lists (1)

Sort

GPU-Related

5 repositories

Stars

OSU-STARLAB / UVM_benchmark

Roff 27 7 Updated Sep 9, 2020

adamtiger / tinyGPUlang

Tutorial on building a gpu compiler backend in LLVM

C++ 30 9 Updated Jan 11, 2025

0xD0GF00D / DocumentSASS

Unofficial description of the CUDA assembly (SASS) instruction sets.

Python 105 11 Updated Mar 10, 2025

facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,654 381 Updated Apr 1, 2025

SzymonOzog / FastSoftmax

Cuda 43 6 Updated Jan 6, 2025

fattorib / CudaSoftmax

Softmax CUDA kernel :)

Cuda 2 Updated Apr 20, 2024

mc-imperial / gpuverify

GPUVerify: a Verifier for GPU Kernels

C# 62 16 Updated Jul 28, 2022

ucb-bar / virgo

Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC

Scala 34 1 Updated Jun 12, 2025

hernanponcedeleon / Dat3M

A verification tool for many memory models

Java 96 31 Updated Jul 2, 2025

seb-v / fp32_sgemm_amd

Super fast FP32 matrix multiplication on RDNA3

Assembly 67 2 Updated Mar 30, 2025

tinygrad / 7900xtx

Python 448 31 Updated Apr 6, 2025

ChipsandCheese / gpuperftests

Derived from Nemes' gpuperftests

C++ 30 5 Updated Jul 11, 2024

reeselevine / webgpu-litmus

WGSL 14 1 Updated Dec 8, 2023

JohndeVostok / APE

A GPU FP32 computation method with Tensor Cores.

C++ 20 3 Updated Nov 20, 2022

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,498 159 Updated Jul 3, 2025

tue-es / gpu-cache-model

A GPU cache model for research purposes

C++ 28 7 Updated Nov 4, 2013

microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats

Python 255 33 Updated Jun 18, 2025

ROCm / ROCm-ComputeABI-Doc

ROCm - AMDGPU Compute Application Binary Interface

41 12 Updated Mar 19, 2022

baidu-research / DeepBench

Benchmarking Deep Learning operations on different hardware

C++ 1,089 236 Updated Apr 25, 2021

0x5ec1ab / gpu-tlb

C++ 62 17 Updated Apr 18, 2025

Faraz9877 / H100_GEMM

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

7A69 Cuda 8 Updated Dec 4, 2024

georgia-tech-synergy-lab / SIGMA

RTL implementation of Flex-DPE.

Verilog 103 29 Updated Feb 22, 2020

mikolalysenko / angle

Almost Native Graphics Layer Engine (local fork)

C++ 27 24 Updated Jul 25, 2023

PeterTh / uCLbench

Set of OpenCL microbenchmarks

C++ 29 10 Updated Feb 18, 2024

RippeR37 / GL_vs_VK

Comparison of OpenGL and Vulkan API in terms of performance.

C++ 85 10 Updated Jul 8, 2022

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

C++ 342 61 Updated Jul 5, 2025

ColfaxResearch / cfx-article-src

C++ 123 26 Updated May 7, 2025

Ratbuyer / h100-features

Cuda 13 7 Updated Mar 12, 2025

Apress / data-parallel-CPP

Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…

CMake 275 87 Updated Mar 26, 2025

linebender / vello

A GPU compute-centric 2D renderer.

Rust 2,946 164 Updated Jul 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly