A comprehensive command-line tool for quantizing models with multiple quantization strategies, supporting both inference and training-based approaches.
- Multiple Quantization Modes:
- 8-bit and 4-bit quantization (bitsandbytes)
- Dynamic Quantization (PyTorch)
- Quantization-Aware Training (QAT)
- Weight-Only Quantization
- Automatic model size reduction
- Hugging Face Hub integration
- Mixed precision support
- Detailed progress logging
- Tokenizer handling
pip install -r requirements.txt
python main.py --model_name "bert-base-uncased" --quant_mode "8bit"
python main.py \
--model_name "bert-base-uncased" \
--quant_mode "dynamic"
python main.py \
--model_name "bert-base-uncased" \
--quant_mode "qat" \
--dataset_name "sst2" \
--num_train_epochs 3 \
--train_batch_size 8 \
--learning_rate 2e-5
python main.py \
--model_name "bert-base-uncased" \
--quant_mode "weight_only"
python main.py \
--model_name "bert-base-uncased" \
--quant_mode "8bit" \
--push_to_hub \
--repo_name "my-quantized-model" \
--use_auth_token
--model_name
: Name or path of the pretrained model on Hugging Face--quant_mode
: Type of quantization ("8bit", "4bit", "dynamic", "qat", "weight_only")
--push_to_hub
: Upload the quantized model to Hugging Face Hub--repo_name
: Name of the Hub repo (default: {model_name}-quantized)--use_auth_token
: Use Hugging Face auth token for uploading--output_dir
: Path to save the quantized model locally (default: ./quantized_model)--verbose
: Enable verbose logging
--double_quant
: Enable double quantization (default: False)--quant_type_4bit
: Type of 4-bit quantization ("fp4" or "nf4", default: "nf4")--compute_dtype
: Compute dtype ("float16", "bfloat16", "float32", default: "float16")
--dataset_name
: Hugging Face dataset name for fine-tuning (required for QAT)--num_train_epochs
: Number of training epochs (default: 3)--train_batch_size
: Training batch size (default: 8)--learning_rate
: Learning rate for fine-tuning (default: 2e-5)
python main.py \
--model_name "gpt2" \
--quant_mode "dynamic" \
--output_dir "./my_quantized_gpt2"
python main.py \
--model_name "gpt2" \
--quant_mode "qat" \
--dataset_name "sst2" \
--num_train_epochs 5 \
--push_to_hub \
--repo_name "gpt2-qat-quantized"
- For uploading to the Hub, you need to be logged in to Hugging Face
- The tool automatically handles tokenizer saving and uploading
- Model size reduction information is displayed during quantization
- QAT mode requires a dataset for fine-tuning
- Dynamic quantization is applied to Linear and LSTM layers
- Weight-only quantization preserves activation precision