Lists (1)
Sort Name ascending (A-Z)
Stars
Tutorial on building a gpu compiler backend in LLVM
Unofficial description of the CUDA assembly (SASS) instruction sets.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC
Super fast FP32 matrix multiplication on RDNA3
PyTorch emulation library for Microscaling (MX)-compatible data formats
ROCm - AMDGPU Compute Application Binary Interface
Benchmarking Deep Learning operations on different hardware
High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.
RTL implementation of Flex-DPE.
Almost Native Graphics Layer Engine (local fork)
Comparison of OpenGL and Vulkan API in terms of performance.
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…