This is the official PyTorch implementation of FLAT-LLM Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression arxiv
Installation instructions can be found in INSTALL.md.
All scripts for reproducing our main results (Table 1) are available in the scripts
directory.
- Run
llama_bi.sh
to compute decoder-wise importance scores. - Run
compute_rank.py
to- Allocate ranks with our IPRS algorithm according to the importance scores.
- Compute the compression ratio for V,O,MLP layers (Q,K are not pruned) according to required total compresstion ratio.
Run one of the following scripts to prune and evaluate the corresponding model:
llama_7b.sh
# use 1 A100 40GBllama_13b.sh
# use 1 A100 40GBllama_70b.sh
# use 4 A100 40GBmistral.sh
# use 1 A100 40GB
These reproduce the perplexity results reported in Table 1 of the paper when using wikitext2 for calibration.
--model
: Name or path of the LLM to prune. Choices:meta-llama/Llama-2-7b-hf
,meta-llama/Llama-2-13b-hf
,meta-llama/Llama-2-70b-hf
,mistralai/Mistral-7B-v0.1
--dataset
: Calibration dataset. Choices:wikitext2
,c4
,alpaca
.--cache_dir
: Directory to cache model weights.
--prune_method
: Pruning stage. Options:bi
: Rank allocation via importance scores.flatllm
: Final pruning using head-wise PCA.
--sparsity_ratio
: Target sparsity level (as an integer percentage).--tol
: Tolerance threshold on cumulative eigenvalues. Default:0.96
. (this hyper-para is only for monitoring the calibration, not used in the algorithm)--bi_score
: Path to save/load the importance scores/allocated ranks.--seed
: Random seed for reproducibility.--nsamples
: Number of calibration samples.--save
: Path to save logs.--save_model
: Path to save the pruned model.
We evaluate zero-shot downstream task performance using the EleutherAI LM Harness. Please use the modified code for zero-shot/few-shot evaluation in lm_eval repo.
To benchmark inference speedup, we build upon the evaluation framework from SliceGPT.
This project is licensed under the MIT License.