10000 ReKarma (LeeHX) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View ReKarma's full-sized avatar
🎯
Focusing
🎯
Focusing
  • ByteDance Inc

Block or report ReKarma

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 507 59 Updated Sep 11, 2024

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Jupyter Notebook 107 4 Updated Mar 13, 2024

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 103,491 10,256 Updated Jul 13, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 474 48 Updated Jul 11, 2025

Material for gpu-mode lectures

Jupyter Notebook 4,718 475 Updated Jun 18, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 3,354 375 Updated Jul 13, 2025

CUTLASS and CuTe Examples

Cuda 63 9 Updated Jan 4, 2025

Triton-based implementation of Sparse Mixture of Experts.

Python 225 18 Updated Nov 28, 2024

Radial Attention Official Implementation

Python 348 16 Updated Jul 6, 2025

Github mirror of trition-lang/triton repo.

MLIR 48 16 Updated Jul 12, 2025

Introduction to Parallel Programming class code

Cuda 1,322 1,140 Updated Jun 27, 2022
Python 2,239 148 Updated Jul 11, 2025

MAGI-1: Autoregressive Video Generation at Scale

Python 3,372 196 Updated Jun 17, 2025

A collection of memory efficient attention operators implemented in the Triton language.

Python 272 18 Updated Jun 5, 2024

⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉

Cuda 191 8 Updated May 10, 2025

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 72 6 Updated Aug 12, 2024

Open-source unified multimodal model

Python 4,542 383 Updated Jul 2, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 5,484 576 Updated Jul 13, 2025

Research project on scaling GPU-accelerated data management to large data volumes. Code base of two SIGMOD papers.

Rust 17 5 Updated Jun 14, 2022

Distributed Compiler based on Triton for Parallel Systems

Python 880 70 Updated Jul 11, 2025

[CVPR 2025 Highlight] TinyFusion: Diffusion Transformers Learned Shallow

Python 130 1 Updated Apr 5, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,008 68 Updated Jul 8, 2025

[ICML2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Python 365 15 Updated Jun 6, 2025

[NeurIPS 2023] Structural Pruning for Diffusion Models

Python 198 14 Updated Jul 8, 2024

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,862 280 Updated May 15, 2025

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,136 920 Updated Jun 17, 2025

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Cuda 649 48 Updated Jun 19, 2025

Analyze computation-communication overlap in V3/R1.

1,076 144 Updated Mar 21, 2025

Expert Parallelism Load Balancer

Python 1,231 194 Updated Mar 24, 2025
Next
0