8000 GitHub - ModelTC/llmc: [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ llmc Public

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

License

Notifications You must be signed in to change notification settings

ModelTC/llmc

Repository files navigation

LLMC: Towards Accurate and Efficient LLM Compression

llmc

License arXiv GitHub Stars visitors Discord Banner QQ Doc Doc

[ English | δΈ­ζ–‡ | ζ—₯本θͺž ]

LLMC is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.

English doc is here.

Chinese doc is here.

Docker hub is here.

Aliyun docker: registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]

You can download the Docker image that can run llmc with the following command. Users in mainland China are recommended to use Alibaba Cloud Docker.

docker hub

docker pull llmcompression/llmc:pure-latest

aliyun docker

docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-latest

Community:

Latest News

  • May 12, 2025: πŸ”₯ We now fully support quantization for the Wan2.1 series of video generation models and provide export of truly quantized INT8/FP8 weights, compatible with the lightx2v inference framework. For details, please refer to the lightx2v documentation.

  • Feb 7, 2025: πŸ”₯ We now fully support quantization of large-scale MOE models like DeepSeekv3, DeepSeek-R1, and DeepSeek-R1-zero with 671B parameters. You can now directly load FP8 weights without any extra conversion. AWQ and RTN quantization can run on a single 80GB GPU, and we also support the export of true quantized INT4/INT8 weights.

  • Nov 20, 2024: πŸ”₯ We now fully support the quantization of ✨DeepSeekv2(2.5) and other MOE models, as well as ✨Qwen2VL, Llama3.2, and other VLM models. Supported quantization methods include βœ…integer quantization, βœ…floating-point quantization, and advanced algorithms like βœ…AWQ, βœ…GPTQ, βœ…SmoothQuant, and βœ…Quarot.

  • Nov 12, 2024: πŸ”₯ We have added support for πŸ’₯static per-tensor activation quantization across various models and algorithms, covering βœ…integer quantization and βœ…floating-point quantization to further optimize performance and efficiency. Additionally, we now support exporting ✨real quantized models and using the VLLM and SGLang backends for inference acceleration. For more details, refer to the VLLM documentation and SGLang documentation.

  • Sep 26, 2024: πŸ”₯ We now support exporting πŸ’₯FP8 quantized(E4M3, E5M2) models from πŸš€LLMC to advanced inference backends such as VLLM and SGLang. For detailed usage, please refer to the VLLM documentation and SGLang documentation.

  • Sep 24, 2024: πŸ”₯ We have officially released βœ…INT4 and βœ…INT8 models of ✨Llama-3.1-405B, quantized using πŸš€LLMC in save_lightllm mode. You can download the model parameters here.

  • Sep 23, 2024: πŸ”₯ We now support exporting ✨real quantized(INT4, INT8) models from πŸš€LLMC to advanced inference backends such as VLLM, SGLang, AutoAWQ, and MLC-LLM for quantized inference deployment, enabling ✨reduced memory usage and ✨faster inference speeds. For detailed usage, please refer to the VLLM documentation, SGLang documentation, AutoAWQ documentation, and MLC-LLM documentation.

  • Sep 9, 2024: πŸ”₯ We provide some configs of our best practice towards superior performance (see Best Practice here).

Previous News
  • Jul 16, 2024: πŸ”₯We support Wanda/Naive(Magnitude) for llm sparsification and layer-wise mix bits quantization now!

  • Jul 14, 2024: πŸ”₯We support rotation based quantization QuaRot now!

  • May 17, 2024: πŸš€ We support some advanced large models, e.g., LLaVA, Mixtral, LLaMA V3 and Qwen V2 now. Have a try!

  • May 13, 2024: 🍺🍺🍺 We release our quantization benchmark paper:

    LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models.

    Ruihao Gong*, Yang Yong*, Shiqiao Gu*, Yushi Huang*, Yunchen Zhang, Xianglong LiuπŸ“§, Dacheng Tao

    (* denotes equal contribution, πŸ“§ denotes corresponding author.)

    comp

    We modularly and fairly benchmark the quantization techniques considering calibration cost, inference efficiency, and quantized accuracy. Near 600 experiments on diverse models and datasets provide three insightful takeaways on the calibration data, algorithm pipeline, and quantization configuration selection. Based on the takeaways, a best practice for the LLM PTQ pipeline is designed, to achieve the best accuracy and efficiency performance balance under various scenarios.

  • Mar 7, 2024: πŸš€ We release the quantization part of a powerful and efficient LLM compression tool. Notably, our benchmark paper is coming soon😊.

Highlight Feature

  • πŸ’₯Comprehensive Algorithm Support: Provides a broad range of ✨SOTA compression algorithms, including βœ…quantization, βœ…mixed-precision quantization, and βœ…sparsity, while maintaining accuracy consistent with the original repositories. ✨Quantization best practices (see πŸš€Best Practices here) are also available to ensure optimal performance and efficiency.

  • πŸ’₯Supported Formats: Supports both ✨quantization (integer and floating-point) and ✨sparsity, specifically including βœ…weight-activation, βœ…weight-only, βœ…mixed-precision quantization, as well as βœ…structured and βœ…unstructured sparsity.

  • πŸ’₯Wide Model Support: Offers support for a diverse array of ✨LLM models, including βœ…LLama, βœ…Mistral, βœ…InternLM2, βœ…Qwen2, among others, as well as βœ…MOE(DeepSeekv2, Deepseek-R1) and βœ…VLM(Llama3.2-vision, Qwen2-vl) models (see Supported Model List).

  • πŸ’₯Multi-backend Compatibility: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as βœ…VLLM, βœ…Sglang, βœ…LightLLM, βœ…MLC-LLM, and βœ…AutoAWQ, making it highly versatile(see Section Backend here).

  • πŸ’₯Performance Efficiency: Enables quantization of large LLMs, such as ✨Llama3.1-405B and ✨DeepSeek-R1-671B, with PPL evaluation on a single A100/H100/H800 GPU.

Usage

Please refer to the πŸš€Quick Start section in the documentation.

Supported Model List

βœ… BLOOM

βœ… LLaMA

βœ… LLaMA V2

βœ… StarCoder

βœ… OPT

βœ… Falcon

βœ… InternLM2

βœ… Mistral

βœ… LLaMA V3

βœ… Mixtral

βœ… Qwen V2

βœ… LLaVA

βœ… InternLM2.5

βœ… StableLM

βœ… Gemma2

βœ… Phi2

βœ… Phi 1.5

βœ… MiniCPM

βœ… SmolLM

βœ… DeepSeekv2.5

βœ… LLaMA V3.2 Vision

βœ… Qwen MOE

βœ… Qwen2-VL

βœ… InternVL2

You can add your own model type referring to files under llmc/models/*.py.

Supported Backend List

βœ… VLLM

βœ… LightLLM

βœ… Sglang

βœ… MLC-LLM

βœ… AutoAWQ

Supported Algorithm List

Quantization

βœ… Naive

βœ… AWQ

βœ… GPTQ

βœ… SmoothQuant

βœ… OS+

βœ… OmniQuant

βœ… NormTweaking

βœ… AdaDim

βœ… QUIK

βœ… SpQR

βœ… DGQ

βœ… OWQ

βœ… LLM.int8()

βœ… HQQ

βœ… QuaRot

βœ… SpinQuant (See this branch)

βœ… TesseraQ

Pruning

βœ… Naive(Magnitude)

βœ… Wanda

βœ… ShortGPT

Acknowledgments

We develop our code referring to the following repos:

Star History

Star History Chart

Citation

If you find our LLM-QBench paper/llmc toolkit useful or relevant to your research, please kindly cite our paper:

@misc{llmc,
   author = {llmc contributors},
   title = {llmc: Towards Accurate and Efficient LLM Compression},
   year = {2024},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{https://github.com/ModelTC/llmc}},
}

@misc{gong2024llmqbench,
      title={LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models},
      author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
      year={2024},
      eprint={2405.06001},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{gong2024llmcbenchmarkinglargelanguage,
      title={LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit},
      author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Chentao Lv and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
      year={2024},
      eprint={2405.06001},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.06001},
}
0