Highlights
- Pro
Stars
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
NVIDIA Linux open GPU kernel module source
CUDATracePreload is a dynamic tracing tool for CUDA and NCCL API calls.
An interference-aware scheduler for fine-grained GPU sharing
Online CUDA Occupancy Calculator
Measure and optimize the energy consumption of your AI applications!
LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism
An unnecessarily tiny implementation of GPT-2 in NumPy.
Official code repository for "CoVA: Exploiting Compressed-Domain Analysis to Accelerate Video Analytics [USENIX ATC 22]"
Cluster Far Mem, framework to execute single job and multi job experiments using fastswap
A recurrent (LSTM) neural network in C
Example C++ CUDA implementation for training Neural Network on MNIST dataset