Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
A Datacenter Scale Distributed Inference Serving Framework
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
FlashMLA: Efficient MLA decoding kernels
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉
Step-by-step optimization of CUDA SGEMM
Simple tutorials on Pytorch DDP training
collection of benchmarks to measure basic GPU capabilities
HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container
The road to hack SysML and become an system expert
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
GLake: optimizing GPU memory management and IO transmission.
Practical GPU Sharing Without Memory Size Constraints
Hooked CUDA-related dynamic libraries by using automated code generation tools.
K8s-club for learn, share and explore the K8s world :)
Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
Heterogeneous AI Computing Virtualization Middleware(Project under CNCF)
NVIDIA Linux open GPU kernel module source
An awesome & curated list of best LLMOps tools for developers
A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod