We introduce CopySpec, an innovative technique designed to tackle the inefficiencies large language models (LLMs) face when generating responses that closely resemble previous outputs. CopySpec identifies repeated sequences in the model’s chat history and speculates that the same tokens will follow, enabling seamless copying without compromising output quality or requiring additional GPU memory.
To evaluate the effectiveness of our approach, we conducted experiments using five LLMs and five datasets: MT-Bench, CNN/DailyMail, GSM-8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn’s answer, simulating real-world scenarios where users request modifications to prior responses.
Our results demonstrate significant speed-ups:
- Up to 2.35× on CNN/DailyMail.
- 3.08× on the second turn of select MT-Redundant categories.
- 2.66× on the third turn of GSM-8K’s self-correction tasks.
- 49% additional speed-up over speculative decoding on the second turn of MT-Redundant across all eight categories.
While LLMs, even with speculative decoding, suffer from slower inference as context sizes grow, CopySpec leverages the expanded context to accelerate inference, making it faster as the context size increases.
The MT-Redundant dataset can be downloaded from:
To use it, replace the questions file in MT-Bench with these new questions.
Clone the repository and install dependencies:
git clone https://github.com/RazvanDu/SpeculativeCopying.git
cd SpeculativeCopying
pip install -r requirements.txt
Set up the environment by specifying the model cache directory and the main repository path:
export cache_dir="<path-to-cache-dir>"
export copyspec_path="<path-to-copyspec-repo>"
For example:
export cache_dir="/mnt/razvandu/speculative_decoding/models_cache"
export copyspec_path="/mnt/razvandu/speculative_decoding/"
To evaluate CopySpec on the CNN/DailyMail dataset, navigate to the CNNDM
directory and run:
python evaluate_cnn.py --model-path "<model-name>" [--use-copy] --gamma <integer>
Example (with CopySpec enabled):
python evaluate_cnn.py --model-path "meta-llama/Llama-3.1-8B-Instruct" --use-copy --gamma 3
--model-path
: (Required) The Hugging Face model identifier.use-copy
: (Optional) Include this flag to enable speculative copying. Omit to disable.--gamma
: (Optional, default:3
) Sets the number of tokens searched for speculative copying (required if--use-copy
is set).
To evaluate CopySpec on EvalPlus, navigate to src/evalplus
and run:
python evalplus/evaluate.py --model "<model-name>" --dataset <dataset-name> --backend <hf/spec> --greedy --device_map "auto" --trust_remote_code true --gamma <integer>
Example:
python evalplus/evaluate.py --model "meta-llama/Llama-3.1-8B-Instruct" --dataset humaneval --backend spec --greedy --device_map "auto" --trust_remote_code true --gamma 3
--model
: (Required) The Hugging Face model identifier.--dataset
: (Required) The dataset to evaluate (e.g.,humaneval
).--backend
: (Required) Can behf
(Hugging Face base model) orspec
(speculative copying).--gamma
: (Optional, default:3
) Sets the number of tokens searched for speculative copying (required if--backend spec
is used).--greedy
: (Optional) Enables greedy decoding.--device_map
: (Optional, default:"auto"
) Sets device mapping for execution.--trust_remote_code
: (Optional, default:true
) Allows loading external code.
To evaluate CopySpec on MT-Bench and MT-Redundant, navigate to the FastChat
directory and first install the necessary dependencies:
cd FastChat
pip install -e .
Then, run the evaluation command:
python fastchat/llm_judge/gen_model_answer.py --model-path "<model-name>" --model-id <model-id> [--use-copy] --gamma <integer> [--use-redundant]
Example (with CopySpec on MT-Redundant):
python fastchat/llm_judge/gen_model_answer.py \
--model-path "meta-llama/Llama-3.1-8B-Instruct" \
--model-id "llama3-8B-experiments-redundant-copy" \
--use-copy \
--gamma 3 \
--use-redundant
--model-path
: (Required) The Hugging Face model identifier.--model-id
: (Required) The identifier used for experiments.use-copy
: (Optional) Include this flag to enable speculative copying. Omit to disable.--gamma
: (Optional, default:3
) Sets the number of tokens searched for speculative copying (required if--use-copy
is set).--use-redundant
: (Optional, default:False
) If set toTrue
, uses the MT-Redundant dataset instead of MT-Bench.
To evaluate CopySpec on the GSM dataset, navigate to the GSM
directory and run:
python cp_gsm.py --model-path "<model-name>" [--use-copy] --gamma <integer>
Example:
python cp_gsm.py --model-path "Qwen/Qwen2.5-7B-Instruct" --use-copy --gamma 3
--model-path
: (Required) The Hugging Face model identifier.use-copy
: (Optional) Include this flag to enable speculative copying. Omit to disable.--gamma
: (Optional, default:3
) Sets the number of tokens searched for speculative copying (required if--use-copy
is set).
If you find this work useful, please cite our paper:
@misc{dumitru2025copyspecacceleratingllmsspeculative,
title={CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality},
author={Razvan-Gabriel Dumitru and Minglai Yang and Vikas Yadav and Mihai Surdeanu},
year={2025},
eprint={2502.08923},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08923},
}