Highlights
Lists (3)
Sort Name ascending (A-Z)
Stars
Accelerate inference without tears
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Official code implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.
Official inference library for Mistral models
High-speed Large Language Model Serving for Local Deployment
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
A series of large language models developed by Baichuan Intelligent Technology
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
The official Python library for the OpenAI API
Simple, safe way to store and distribute tensors
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.