- All languages
- Assembly
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- Cuda
- D2
- Dockerfile
- Elixir
- Go
- HCL
- HTML
- Java
- JavaScript
- Jinja
- Jupyter Notebook
- LLVM
- Lua
- M4
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mustache
- NSIS
- Objective-C
- OpenQASM
- PHP
- PLpgSQL
- Perl
- Python
- R
- Roff
- Ruby
- Rust
- Scala
- Shell
- Smarty
- Starlark
- TypeScript
- Vim Script
- Vue
- YAML
- Yacc
Starred repositories
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Heterogeneous AI Computing Virtualization Middleware
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
Distributed Triton for Parallel Systems
A Easy-to-understand TensorOp Matmul Tutorial
neuralmagic / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
A low-level OpenQASM benchmark suite for NISQ evaluation and simulation. Please see our paper for details.
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
CUDA Python: Performance meets Productivity
This package contains the original 2012 AlexNet code.
neuralmagic / nm-vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A PyTorch native library for large-scale model training
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.
FlashInfer: Kernel Library for LLM Serving
Monitor linux processes without root permissions
Making large AI models cheaper, faster and more accessible
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang