CopySpec: Speculative Copying for Faster Transformer Inference

Overview

We introduce CopySpec, an innovative technique designed to tackle the inefficiencies large language models (LLMs) face when generating responses that closely resemble previous outputs. CopySpec identifies repeated sequences in the model’s chat history and speculates that the same tokens will follow, enabling seamless copying without compromising output quality or requiring additional GPU memory.

To evaluate the effectiveness of our approach, we conducted experiments using five LLMs and five datasets: MT-Bench, CNN/DailyMail, GSM-8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn’s answer, simulating real-world scenarios where users request modifications to prior responses.

Our results demonstrate significant speed-ups:

Up to 2.35× on CNN/DailyMail.
3.08× on the second turn of select MT-Redundant categories.
2.66× on the third turn of GSM-8K’s self-correction tasks.
49% additional speed-up over speculative decoding on the second turn of MT-Redundant across all eight categories.

While LLMs, even with speculative decoding, suffer from slower inference as context sizes grow, CopySpec leverages the expanded context to accelerate inference, making it faster as the context size increases.

MT-Redundant Dataset

The MT-Redundant dataset can be downloaded from:

MT-Redundant Dataset

To use it, replace the questions file in MT-Bench with these new questions.

Installation

Clone the repository and install dependencies:

git clone https://github.com/RazvanDu/SpeculativeCopying.git
cd SpeculativeCopying
pip install -r requirements.txt

Set up the environment by specifying the model cache directory and the main repository path:

export cache_dir="<path-to-cache-dir>"
export copyspec_path="<path-to-copyspec-repo>"

For example:

export cache_dir="/mnt/razvandu/speculative_decoding/models_cache"
export copyspec_path="/mnt/razvandu/speculative_decoding/"

Running Evaluation on CNN/DM

To evaluate CopySpec on the CNN/DailyMail dataset, navigate to the CNNDM directory and run:

python evaluate_cnn.py --model-path "<model-name>" [--use-copy] --gamma <integer>

Example (with CopySpec enabled):

python evaluate_cnn.py --model-path "meta-llama/Llama-3.1-8B-Instruct" --use-copy --gamma 3

Parameters:

--model-path: (Required) The Hugging Face model identifier.
use-copy: (Optional) Include this flag to enable speculative copying. Omit to disable.
--gamma: (Optional, default: 3) Sets the number of tokens searched for speculative copying (required if --use-copy is set).

Running Evaluation on EvalPlus

To evaluate CopySpec on EvalPlus, navigate to src/evalplus and run:

python evalplus/evaluate.py --model "<model-name>" --dataset <dataset-name> --backend <hf/spec> --greedy --device_map "auto" --trust_remote_code true --gamma <integer>

Example:

python evalplus/evaluate.py --model "meta-llama/Llama-3.1-8B-Instruct" --dataset humaneval --backend spec --greedy --device_map "auto" --trust_remote_code true --gamma 3

Parameters:

--model: (Required) The Hugging Face model identifier.
--dataset: (Required) The dataset to evaluate (e.g., humaneval).
--backend: (Required) Can be hf (Hugging Face base model) or spec (speculative copying).
--gamma: (Optional, default: 3) Sets the number of tokens searched for speculative copying (required if --backend spec is used).
--greedy: (Optional) Enables greedy decoding.
--device_map: (Optional, default: "auto") Sets device mapping for execution.
--trust_remote_code: (Optional, default: true) Allows loading external code.

Running MT-Bench and MT-Redundant

To evaluate CopySpec on MT-Bench and MT-Redundant, navigate to the FastChat directory and first install the necessary dependencies:

cd FastChat
pip install -e .

Then, run the evaluation command:

python fastchat/llm_judge/gen_model_answer.py --model-path "<model-name>" --model-id <model-id> [--use-copy] --gamma <integer> [--use-redundant]

Example (with CopySpec on MT-Redundant):

python fastchat/llm_judge/gen_model_answer.py \
  --model-path "meta-llama/Llama-3.1-8B-Instruct" \
  --model-id "llama3-8B-experiments-redundant-copy" \
  --use-copy \
  --gamma 3 \
  --use-redundant

Parameters:

--model-path: (Required) The Hugging Face model identifier.
--model-id: (Required) The identifier used for experiments.
use-copy: (Optional) Include this flag to enable speculative copying. Omit to disable.
--gamma: (Optional, default: 3) Sets the number of tokens searched for speculative copying (required if --use-copy is set).
--use-redundant: (Optional, default: False) If set to True, uses the MT-Redundant dataset instead of MT-Bench.

Running GSM

To evaluate CopySpec on the GSM dataset, navigate to the GSM directory and run:

python cp_gsm.py --model-path "<model-name>" [--use-copy] --gamma <integer>

Example:

python cp_gsm.py --model-path "Qwen/Qwen2.5-7B-Instruct" --use-copy --gamma 3

Parameters:

--model-path: (Required) The Hugging Face model identifier.
use-copy: (Optional) Include this flag to enable speculative copying. Omit to disable.
--gamma: (Optional, default: 3) Sets the number of tokens searched for speculative copying (required if --use-copy is set).

Citation

If you find this work useful, please cite our paper:

@misc{dumitru2025copyspecacceleratingllmsspeculative,
      title={CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality}, 
      author={Razvan-Gabriel Dumitru and Minglai Yang and Vikas Yadav and Mihai Surdeanu},
      year={2025},
      eprint={2502.08923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08923}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
CNNDM		CNNDM
FastChat		FastChat
GSM		GSM
evalplus		evalplus
evalplus.git		evalplus.git
find_best_k		find_best_k
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
speculative_copying.py		speculative_copying.py
speculative_copying_EAGLE.py		speculative_copying_EAGLE.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CopySpec: Speculative Copying for Faster Transformer Inference

Overview

MT-Redundant Dataset

Installation

Running Evaluation on CNN/DM

Parameters:

Running Evaluation on EvalPlus

Parameters:

Running MT-Bench and MT-Redundant

Parameters:

Running GSM

Parameters:

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

RazvanDu/CopySpec

Folders and files

Latest commit

History

Repository files navigation

CopySpec: Speculative Copying for Faster Transformer Inference

Overview

MT-Redundant Dataset

Installation

Running Evaluation on CNN/DM

Parameters:

Running Evaluation on EvalPlus

Parameters:

Running MT-Bench and MT-Redundant

Parameters:

Running GSM

Parameters:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages