Stars
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× vs cuBLAS
FlashMLA: Efficient MLA decoding kernels
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
C++ implementation of a non-blocking binary search tree with insert and search
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
Applied AI experiments and examples for PyTorch
⏩ Create, share, and use custom AI code assistants with our open-source IDE extensions and hub of models, rules, prompts, docs, and other building blocks
Faster alternative to Python's multiprocessing.Queue (IPC FIFO queue)
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
Recipes to scale inference-time compute of open models
Robust recipes to align language models with human and AI preferences
Fully open reproduction of DeepSeek-R1
Train transformer language models with reinforcement learning.
What would you do with 1000 H100s...
Understand and test language model architectures on synthetic tasks.
LightSeq: A High Performance Library for Sequence Processing and Generation