Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModel
s into TensorRT-LLM engines. Normally, building a TensorRT-LLM engine consists of two steps - checkpoint conversion and trtllm-build
- both of which rely on pre-defined model architectures. As a result, converting a novel model requires porting the model with TensorRT-LLM's Python API and writing a custom checkpoint conversion script. By automating these dull procedures, Ditto aims to make TensorRT-LLM more accessible to the broader AI community.
- [2025/02] Blog post introducing Ditto is published! [Blog]
- [2025/02] Ditto 0.1.0 released!
- [2025/04] Ditto 0.2.0 released with new features - MoE, Quantization
- Ease-of-use: Ditto enables users to convert models with a single command.
ditto build <huggingface-model-name>
- Enables conversion of novel model architectures into TensorRT engines, including models that are not supported in TensorRT-LLM due to the absence of checkpoint conversion scripts.
- For example, as of the publication date of this document (February 10, 2025), Helium is supported in Ditto, while it is not in TensorRT-LLM. (Note that you need to re-install transformers nightly-build after installing Ditto as
pip install git+https://github.com/huggingface/transformers.git
)
- For example, as of the publication date of this document (February 10, 2025), Helium is supported in Ditto, while it is not in TensorRT-LLM. (Note that you need to re-install transformers nightly-build after installing Ditto as
- Directly converts quantized HuggingFace models. (Future Work)
We have conducted comprehensive benchmarks for both output quality and inference performance to validate the conversion process of Ditto. Llama3.3-70B-Instruct, Llama3.1-8B-Instruct, and Helium1-preview-2B were used for the benchmarks and all benchmarks were performed with both GEMM and GPT attention plugins enabled.
We used TensorRT-LLM llmapi integrated with lm-evaluation-harness for quality evaluation. For Helium model, ifeval task was excluded since it is not an instruction model.
MMLU (Accuracy) |
wikitext2 (PPL) |
gpqa_main _zeroshot (Accuracy) |
arc_challenge (Accuracy) |
ifeval (Accuracy) |
||
---|---|---|---|---|---|---|
Llama3.3-70B-Instruct | Ditto | 0.819 | 3.96 | 0.507 | 0.928 | 0.915 |
TRT-LLM | 0.819 | 3.96 | 0.507 | 0.928 | 0.915 | |
Llama3.1-8B-Instruct | Ditto | 0.680 | 8.64 | 0.350 | 0.823 | 0.815 |
TRT-LLM | 0.680 | 8.64 | 0.350 | 0.823 | 0.815 | |
Helium1-preview-2B | Ditto | 0.486 | 11.37 | 0.263 | 0.578 | - |
TRT-LLM | Not Supported |
NOTE: All tasks were tested as 0-shot.
Performance benchmarks were conducted using TensorRT-LLM gptManagerBenchmark. A100 in the table represents A100-SXM4-80GB.
TP | A100 (token/sec) |
A6000 (token/sec) |
L40 (token/sec) |
||
---|---|---|---|---|---|
Llama3.3-70B-Instruct | Ditto | 4 | 1759.2 | - | - |
TRT-LLM | 4 | 1751.6 | - | - | |
Llama3.1-8B-Instruct | Ditto | 1 | 3357.9 | 1479.8 | 1085.2 |
TRT-LLM | 1 | 3318.0 | 1508.6 | 1086.5 | |
Helium1-preview-2B | Ditto | 1 | - | 1439.5 | 1340.5 |
TRT-LLM | 1 | Not Supported |
- Llama2-7B
- Llama3-8B
- LLama3.1-8B
- Llama3.2
- Llama3.3-70B
- Mistral-7B
- Gemma2-9B
- Phi4
- Phi3.5-mini
- Qwen2-7B
- Codellama
- Codestral
- ExaOne3.5-8B
- aya-expanse-8B
- Llama-DNA-1.0-8B
- SOLAR-10.7B
- Falcon
- Nemotron
- 42dot_LLM-SFT-1.3B
- Helium1-2B
- Sky-T1-32B
- SmolLM2-1.7B
- Mixtral-8x7B
- Qwen-MoE
- DeepSeek-V1, V2
- and many others that we haven't tested yet
- Multi LoRA
- Tensor Parallelism / Pipeline Parallelism
- Mixture of Experts
- Quantization - Weight-only & FP8 (AutoAWQ, AutoGPTQ, Compressed Tensors)
Below features are planned to be supported in Ditto in the near future. Feel free to reach out if you have any questions or suggestions.
- Additional Quantization Support
- Expert Parallelism
- Multimodal
- Speculative Decoding
- Prefix Caching
- State Space Model
- Encode-Decoder Model