8000 matlinsas (matlinsas) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View matlinsas's full-sized avatar

Block or report matlinsas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Tutorial on building a gpu compiler backend in LLVM

C++ 30 9 Updated Jan 11, 2025

Unofficial description of the CUDA assembly (SASS) instruction sets.

Python 105 11 Updated Mar 10, 2025

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,654 381 Updated Apr 1, 2025
Cuda 43 6 Updated Jan 6, 2025

Softmax CUDA kernel :)

Cuda 2 Updated Apr 20, 2024

GPUVerify: a Verifier for GPU Kernels

C# 62 16 Updated Jul 28, 2022

Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC

Scala 34 1 Updated Jun 12, 2025

A verification tool for many memory models

Java 96 31 Updated Jul 2, 2025

Super fast FP32 matrix multiplication on RDNA3

Assembly 67 2 Updated Mar 30, 2025
Python 448 31 Updated Apr 6, 2025

Derived from Nemes' gpuperftests

C++ 30 5 Updated Jul 11, 2024

A GPU FP32 computation method with Tensor Cores.

C++ 20 3 Updated Nov 20, 2022

Tile primitives for speedy kernels

Cuda 2,498 159 Updated Jul 3, 2025

A GPU cache model for research purposes

C++ 28 7 Updated Nov 4, 2013

PyTorch emulation library for Microscaling (MX)-compatible data formats

Python 255 33 Updated Jun 18, 2025

ROCm - AMDGPU Compute Application Binary Interface

41 12 Updated Mar 19, 2022

Benchmarking Deep Learning operations on different hardware

C++ 1,089 236 Updated Apr 25, 2021
C++ 62 17 Updated Apr 18, 2025

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

7A69 Cuda 8 Updated Dec 4, 2024

RTL implementation of Flex-DPE.

Verilog 103 29 Updated Feb 22, 2020

Almost Native Graphics Layer Engine (local fork)

C++ 27 24 Updated Jul 25, 2023

Set of OpenCL microbenchmarks

C++ 29 10 Updated Feb 18, 2024

Comparison of OpenGL and Vulkan API in terms of performance.

C++ 85 10 Updated Jul 8, 2022

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

C++ 342 61 Updated Jul 5, 2025
Cuda 13 7 Updated Mar 12, 2025

Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…

CMake 275 87 Updated Mar 26, 2025

A GPU compute-centric 2D renderer.

Rust 2,946 164 Updated Jul 4, 2025
Next
0