Lists (3)
Sort Name ascending (A-Z)
Stars
Cost-efficient and pluggable Infrastructure components for GenAI inference
This repository is a curated collection of resources, tutorials, and practical examples designed to guide you through the journey of mastering CUDA programming. Whether you're just starting or look…
Perforator is a cluster-wide continuous profiling tool designed for large data centers
collection of benchmarks to measure basic GPU capabilities
🚴 Call stack profiler for Python. Shows you why your code is slow!
Dynamic Memory Management for Serving LLMs without PagedAttention
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
A throughput-oriented high-performance serving framework for LLMs
Nvidia Instruction Set Specification Generator
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
llama3 implementation one matrix multiplication at a time
System design patterns for machine learning
High performance server-side application framework
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
High-performance containers and utilities for concurrent and asynchronous programming
Asynchronous Programming in Rust, published by Packt
High-level, optionally asynchronous Rust bindings to llama.cpp
“Zero setup” cross compilation and “cross testing” of Rust crates
ix package manager, statically build packages, for darwin/linux, with clang