Stars
Fast and memory-efficient exact attention
Fast Hadamard transform in CUDA, with a PyTorch interface
A minimal implementation of Direct Preference Optimization (DPO) in Chinese
Just a few lines to combine 🤗 Transformers, Flash Attention 2, and torch.compile — simple, clean, fast ⚡
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
SGLang is a fast serving framework for large language models and vision language models.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
A high-throughput and memory-efficient inference and serving engine for LLMs
Fast and memory-efficient exact attention
GLake: optimizing GPU memory management and IO transmission.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.