8000 GitHub - SqueezeBits/Torch-TRTLLM: Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.

License

Notifications You must be signed in to change notification settings

SqueezeBits/Torch-TRTLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ditto logo

pytorch transformers tensorrt-llm torch-tensorrt version license

Ditto - Direct Torch to TensorRT-LLM Optimizer

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines. Normally, building a TensorRT-LLM engine consists of two steps - checkpoint conversion and trtllm-build - both of which rely on pre-defined model architectures. As a result, converting a novel model requires porting the model with TensorRT-LLM's Python API and writing a custom checkpoint conversion script. By automating these dull procedures, Ditto aims to make TensorRT-LLM more accessible to the broader AI community.

Ditto logo

Latest News

  • [2025/02] Blog post introducing Ditto is published! [Blog]
  • [2025/02] Ditto 0.1.0 released!
  • [2025/04] Ditto 0.2.0 released with new features - MoE, Quantization

Getting Started

Key Advantages

  • Ease-of-use: Ditto enables users to convert models with a single command.
    ditto build <huggingface-model-name>
    
  • Enables conversion of novel model architectures into TensorRT engines, including models that are not supported in TensorRT-LLM due to the absence of checkpoint conversion scripts.
    • For example, as of the publication date of this document (February 10, 2025), Helium is supported in Ditto, while it is not in TensorRT-LLM. (Note that you need to re-install transformers nightly-build after installing Ditto as pip install git+https://github.com/huggingface/transformers.git)
  • Directly converts quantized HuggingFace models. (Future Work)

Benchmarks

We have conducted comprehensive benchmarks for both output quality and inference performance to validate the conversion process of Ditto. Llama3.3-70B-Instruct, Llama3.1-8B-Instruct, and Helium1-preview-2B were used for the benchmarks and all benchmarks were performed with both GEMM and GPT attention plugins enabled.

Quality

We used TensorRT-LLM llmapi integrated with lm-evaluation-harness for quality evaluation. For Helium model, ifeval task was excluded since it is not an instruction model.

MMLU
(Accuracy)
wikitext2
(PPL)
gpqa_main
_zeroshot
(Accuracy)
arc_challenge
(Accuracy)
ifeval
(Accuracy)
Llama3.3-70B-Instruct Ditto 0.819 3.96 0.507 0.928 0.915
TRT-LLM 0.819 3.96 0.507 0.928 0.915
Llama3.1-8B-Instruct Ditto 0.680 8.64 0.350 0.823 0.815
TRT-LLM 0.680 8.64 0.350 0.823 0.815
Helium1-preview-2B Ditto 0.486 11.37 0.263 0.578 -
TRT-LLM Not Supported

NOTE: All tasks were tested as 0-shot.

Throughput

Performance benchmarks were conducted using TensorRT-LLM gptManagerBenchmark. A100 in the table represents A100-SXM4-80GB.

TP A100
(token/sec)
A6000
(token/sec)
L40
(token/sec)
Llama3.3-70B-Instruct Ditto 4 1759.2 - -
TRT-LLM 4 1751.6 - -
Llama3.1-8B-Instruct Ditto 1 3357.9 1479.8 1085.2
TRT-LLM 1 3318.0 1508.6 1086.5
Helium1-preview-2B Ditto 1 - 1439.5 1340.5
TRT-LLM 1 Not Supported

Support Matrix

Models

  • Llama2-7B
  • Llama3-8B
  • LLama3.1-8B
  • Llama3.2
  • Llama3.3-70B
  • Mistral-7B
  • Gemma2-9B
  • Phi4
  • Phi3.5-mini
  • Qwen2-7B
  • Codellama
  • Codestral
  • ExaOne3.5-8B
  • aya-expanse-8B
  • Llama-DNA-1.0-8B
  • SOLAR-10.7B
  • Falcon
  • Nemotron
  • 42dot_LLM-SFT-1.3B
  • Helium1-2B
  • Sky-T1-32B
  • SmolLM2-1.7B
  • Mixtral-8x7B
  • Qwen-MoE
  • DeepSeek-V1, V2
  • and many others that we haven't tested yet

Features

What's Next?

Below features are planned to be supported in Ditto in the near future. Feel free to reach out if you have any questions or suggestions.

  • Additional Quantization Support
  • Expert Parallelism
  • Multimodal
  • Speculative Decoding
  • Prefix Caching
  • State Space Model
  • Encode-Decoder Model

References

About

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published
0