8000 weishengying (weishengying) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View weishengying's full-sized avatar

Block or report weishengying

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Jupyter Notebook 133 16 Updated Mar 4, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 14,384 1,766 Updated May 16, 2025

https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

Python 1,257 79 Updated Mar 27, 2025

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 65 4 Updated Aug 12, 2024

A latent text-to-image diffusion model

Jupyter Notebook 70,642 10,435 Updated Jun 18, 2024

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,936 208 Updated May 15, 2025

Optimized primitives for collective multi-GPU communication

C++ 3,718 916 Updated Apr 29, 2025
Cuda 12 2 Updated Aug 14, 2024

Material for gpu-mode lectures

Jupyter Notebook 4,435 448 Updated Feb 9, 2025

how to optimize some algorithm in cuda.

Cuda 2,171 190 Updated May 15, 2025

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 41 5 Updated Aug 12, 2024

Fast and memory-efficient exact attention

Python 17,371 1,687 Updated May 8, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 351 35 Updated May 14, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 814 79 Updated Dec 30, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 2,950 304 Updated May 15, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,221 451 Updated May 12, 2025

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,845 515 Updated Apr 11, 2025

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 2,161 270 Updated May 11, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 819 66 Updated Sep 4, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,003 247 Updated May 9, 2025

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python 805 62 Updated Oct 8, 2024

我的电视 电视直播软件,安装即可使用

C 32,015 3,583 Updated Jun 20, 2024

This repository contains integer operators on GPUs for PyTorch.

Python 204 54 Updated Sep 29, 2023

Fast inference from large lauguage models via speculative decoding

Python 726 68 Updated Aug 22, 2024

CUDA Core Compute Libraries

C++ 1,636 215 Updated May 16, 2025
C++ 537 94 Updated May 15, 2025

MoE layer for pytorch

C++ 3 Updated Jan 16, 2024

TensorRT Examples (TensorRT, Jetson Nano, Python, C++)

Jupyter Notebook 94 24 Updated Nov 10, 2023
Next
0