demonbibi

demonbibi

3 followers · 5 following

Stars

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,344 593 Updated May 16, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,662 769 Updated May 12, 2025

deepseek-ai / DeepSeek-V3

Python 96,829 15,741 Updated Apr 9, 2025

meta-recsys / generative-recommenders

Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152). 10000

Python 1,060 201 Updated May 15, 2025

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 32,425 13,469 Updated May 17, 2025

BBuf / tvm_mlir_learn

compiler learning resources collect.

Python 2,384 347 Updated Mar 19, 2025

state-spaces / mamba

Mamba SSM architecture

Python 14,878 1,302 Updated May 9, 2025

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 14,428 2,870 Updated May 17, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,244 453 Updated May 12, 2025

NVIDIA / atex

A TensorFlow Extension: GPU performance tools for TensorFlow.

Python 26 7 Updated Jul 27, 2023

alibaba / BladeDISC

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

C++ 865 164 Updated Dec 30, 2024

openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators

C++ 3,172 553 Updated May 17, 2025

feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 496 42 Updated Apr 21, 2025

NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

C 15,781 1,400 Updated May 12, 2025

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,413 422 Updated May 17, 2025