-
USTC
- HeFei, China
- https://guopeng-gpli.github.io/
Highlights
- Pro
Stars
[ICML 2025🔥] ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Framework for running AI locally on mobile devices and wearables. Hardware-aware C/C++ backend with wrappers for Flutter & React Native. Kotlin & Swift coming soon.
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
FlashMLA: Efficient MLA decoding kernels
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
An GPU/CUDA implementation of the Hungarian algorithm
Beginner-friendly serverless LLM deployment with Replicate & fly.io
Caribou is a framework for geo-distributed deployment of serverless workflows to save carbon emissions.
ustc thesis proposal 中国科学技术大学 开题报告 latex 模板
Code for reproducing results for SOSP paper Bagpipe
Efficient and easy multi-instance LLM serving
📚A curated list of Awesome LLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
基于pytorch的中文意图识别和槽位填充
Production-ready platform for agentic workflow development.
BERT-based intent and slots detector for chatbots.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
a curated list of high-quality papers on resource-efficient LLMs 🌱
Serverless LLM Serving for Everyone.
Large Language Model (LLM) Systems Paper List
A curated list for Efficient Large Language Models
Semantic Kernel (SK) is a lightweight SDK enabling integration of AI Large Language Models (LLMs) with conventional programming languages.
🚀 Docker 镜像代理,通过 GitHub Actions 将 docker.io、gcr.io、registry.k8s.io、k8s.gcr.io、quay.io、ghcr.io 等国外镜像转换为国内镜像加速下载
Secure Transformer Inference is a protocol for serving Transformer-based models securely.