Starred repositories
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
DuckLake is an integrated data lake and catalog format
Analyze computation-communication overlap in V3/R1.
Collection of extracted System Prompts from popular chatbots like ChatGPT, Claude & Gemini
Production-grade client-side tracing, profiling, and analysis for complex software systems.
A high-throughput and memory-efficient inference and serving engine for LLMs
My learning notes/codes for ML SYS.
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
DeepSeek-V3/R1 inference performance simulator
A Datacenter Scale Distributed Inference Serving Framework
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Transformer related optimization, including BERT, GPT
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient MLA decoding kernels
Cost-efficient and pluggable Infrastructure components for GenAI inference
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the si…
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
Fast and memory-efficient exact attention