Lists (3)
Sort Name ascending (A-Z)
Stars
Simple MPI implementation for prototyping or learning
A powerful and artistic UI library based on PyQt5,基于 PyQt5 的UI框架,灵动、优雅而轻便
Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…
A Easy-to-understand TensorOp Matmul Tutorial
A minimal GPU design in Verilog to learn how GPUs work from the ground up
the resources about the application based on LLM with RAG pattern
A comprehensive guide to building RAG-based LLM applications for production.
A simple high performance CUDA GEMM implementation.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
A tool for bandwidth measurements on NVIDIA GPUs.
Benchmark code for the "Online normalizer calculation for softmax" paper
collection of benchmarks to measure basic GPU capabilities
C++ project template with unit-tests, documentation, ci-testing and workflows.
An extension library of WMMA API (Tensor Core API)