8000 matlinsas (matlinsas) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View matlinsas's full-sized avatar

Block or report matlinsas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
357 results for source starred repositories
Clear filter

Tutorial on building a gpu compiler backend in LLVM

C++ 31 9 Updated Jan 11, 2025

Unofficial description of the CUDA assembly (SASS) instruction sets.

Python 107 12 Updated Jul 18, 2025

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,656 383 Updated Apr 1, 2025
Cuda 47 6 Updated Jan 6, 2025

Softmax CUDA kernel :)

Cuda 3 Updated Apr 20, 2024

GPUVerify: a Verifier for GPU Kernels

C# 63 16 Updated Jul 28, 2022

Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC

Scala 35 1 Updated Jun 12, 2025

A verification tool for many memory models

Java 96 31 Updated Jul 18, 2025

Super fast FP32 matrix multiplication on RDNA3

Assembly 68 2 Updated Mar 30, 2025
Python 448 31 Updated Apr 6, 2025

Derived from Nemes' gpuperftests

C++ 30 5 Updated Jul 11, 2024

A GPU FP32 computation method with Tensor Cores.

C++ 21 3 Updated Nov 20, 2022

Tile primitives for speedy kernels

Cuda 2,523 160 Updated Jul 15, 2025

A GPU cache model for research purposes

C++ 28 7 Updated Nov 4, 2013

PyTorch emulation library for Microscaling (MX)-compatible data formats

Python 259 34 Updated Jun 18, 2025

Benchmarking Deep Learning operations on different hardware

C++ 1,091 237 Updated Apr 25, 2021
C++ 64 18 Updated Apr 18, 2025

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

Cuda 8 Updated Dec 4, 2024

RTL implementation of Flex-DPE.

Verilog 106 29 Updated Feb 22, 2020

Almost Native Graphics Layer Engine (local fork)

C++ 27 24 Updated Jul 25, 2023

Set of OpenCL microbenchmarks

C++ 29 10 Updated Feb 18, 2024

Comparison of OpenGL and Vulkan API in terms of performance.

C++ 85 10 Updated Jul 8, 2022

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

C++ 342 61 Updated Jul 19, 2025
Cuda 13 7 Updated Mar 12, 2025

Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…

CMake 275 87 Updated Mar 26, 2025

A GPU compute-centric 2D renderer.

Rust 2,985 164 Updated Jul 18, 2025

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 511 60 Updated Sep 11, 2024
Next
0