Stars
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152). 10000
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
A TensorFlow Extension: GPU performance tools for TensorFlow.
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
A machine learning compiler for GPUs, CPUs, and ML accelerators
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
NVIDIA Linux open GPU kernel module source
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
Making large AI models cheaper, faster and more accessible
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
MetaBalance algorithm for multi-task learning
Official PyTorch Implementation for Conflict-Averse Gradient Descent (CAGrad)
how to learn PyTorch and OneFlow
Reference implementation for DPO (Direct Preference Optimization)
DLRover: An Automatic Distributed Deep Learning System
Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
A permissively licensed C and C++ Task Scheduler for creating parallel programs. Requires C++11 support.
A list of awesome papers and resources of recommender system on large language model (LLM).
Hackable and optimized Transformers building blocks, supporting a composable construction.