- Guangzhou
-
23:06
(UTC +08:00) - www.shuxiaobo.com
Starred repositories
MiniCPM4: Ultra-Efficient LLMs on End Devices, achieving 5+ speedup on typical end-side chips
A PyTorch native platform for training generative AI models
LLM Transparency Tool (LLM-TT), an open-source interactive toolkit for analyzing internal workings of Transformer-based language models. *Check out demo at* https://huggingface.co/spaces/facebook/l…
Pretraining code for a large-scale depth-recurrent language model
SGLang is a fast serving framework for large language models and vision language models.
A pytorch quantization backend for optimum
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"
Efficient 2:4 sparse training algorithms and implementations
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
[NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".
A series of technical report on Slow Thinking with LLM
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Efficient triton implementation of Native Sparse Attention.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
DeepEP: an efficient expert-parallel communication library
some common Huggingface transformers in maximal update parametrization (µP)
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
[ACM CoNEXT22 Best Paper Award] NTSocks: An ultra-low latency and compatible PCIe interconnect for rack-scale disaggregation.