Lists (1)
Sort Name ascending (A-Z)
Stars
Tutorial on building a gpu compiler backend in LLVM
Unofficial description of the CUDA assembly (SASS) instruction sets.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Cluster-level matrix unit integration into GPUs, implemented in Chipyard SoC
A verification tool for many memory models
Super fast FP32 matrix multiplication on RDNA3
Tile primitives for speedy kernels
PyTorch emulation library for Microscaling (MX)-compatible data formats
Benchmarking Deep Learning operations on different hardware
High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.
RTL implementation of Flex-DPE.
Almost Native Graphics Layer Engine (local fork)
Comparison of OpenGL and Vulkan API in terms of performance.
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xin…
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.