📄 Paper | 🤗 EvilMath | 🤗 Models | ✏️ Blog Post
This is the code for "The Jailbreak Tax: How Useful are Your Jailbreak Outputs?" by Kristina Nikolić, Luze Sun, Jie Zhang, and Florian Tramèr.
With this codebase you can:
- Evaluate your custom jailbreak techniques on the Jailbreak Tax benchmark
- Generate model prompts in the jailbreak tax benchmark format and use them to test your jailbreak in your own setup
- Automatically grade the model outputs to correct, incorrect or refusal
- Reproduce the experiments from the paper
The framework measures the "jailbreak tax" - the utility loss after a successful jailbreak compared to the unaligned model.
The jailbreak tax experiments are done in three stages:
- Unaligned Stage: Evaluate the base model to obtain the baseline accuracy (alignment off, jailbreak off)
- Before Jailbreak Stage: Align the model to refuse specific tasks (alignment on, jailbreak off)
- After Jailbreak Stage: Attack the aligned model. Measure the aligned model accuracy when subjected to a jailbreak (alignment on, jailbreak on)
- Python 3.12.5
- CUDA 12.4 (for GPU support)
- requirements.txt
- Create a new conda environment:
conda create -n jailbreak-tax python=3.12.5
conda activate jailbreak-tax
- Install PyTorch with CUDA support:
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
- Install other dependencies:
pip install -r requirements.txt
Setup your API keys in the .env
file as in .env.example
. You can leave keys empty if you are not using specific provider.
The run_jb_tax.py
script implements end to end jailbreak tax experiments. It can run all three stages or specific stage as needed.
--question_range
: range of questions to process (e.g., '0-4')--run
: run id--dataset
: choose the dataset to load ('wmdp-bio', 'gsm8k', 'gsm8k-evil', 'math')--attack_type
: jailbreak technique to use--select_model
: model path (e.g., 'meta-llama/Llama-3.1-70B-Instruct')--api_or_local
: whether to use API or local model ('api' or 'local')--alignment_method
: alignment method ('system_prompt', 'fine_tuned', 'no_alignment')--stage
: experiment stage to run ('unaligned', 'before_jb', 'after_jb', or 'all')
Note: For jailbreak-specific arguments, see the Available Jailbreak Attacks section. For alignment-specific arguments, see the Alignment Methods section.
python run_jb_tax.py --question_range 0-9 --run experiment1 --dataset gsm8k --attack_type simple_jb --select_model meta-llama/Meta-Llama-3.1-8B-Instruct --api_or_local local --alignment_method system_prompt --stage all
This command will:
- Process questions 0 to 9 from the gsm8k dataset
- Apply a simple jailbreak attack (System Prompt attack)
- Use the local Llama 3.1 8B model (downloaded from Huggingface)
- Use system prompt alignment method
- Run all stages
- Save results in directory structure based on the provided arguments
Results will be saved in the Results
directory, organized by dataset, run id (--run
), stage, alignment method and/or attack type, and question range. Alternatively, you can set the --custom_save_dir
argument to save results to specific directory.
For each stage the following files will be generated:
grading_totals.json
: result statisticsresults_<dataset>_<alignment_method>_<attack_type>.json
: model outputs and grading resultsprompts_<dataset>_<alignment_method>_<attack_type>.json
: the prompts used for the experiment (before jailbreak if applicable)experiment.log
: log file
If --all
stages are run, the combined results are generated:
combined_results.json
: the model outputs and grading results from all three stagescombined_statistics.json
: grading statistics from all three stages, including the jailbreak taxresponse_distribution.png
: distribution of model outputs for all three stages (correct, incorrect, refusal)
Note: You can also use the combine_grading_results()
function in Utils/jb_tax_grading_utils.py
to combine the results from all three stages. See Combine Results for more details.
Scripts
directory contains example run scripts for all combinations of datasets, alignment methods and jailbreak attacks used in the paper:
WMDP
: Scripts for WMDP datasetMATH
: Scripts for MATH datasetGSM8K
: Scripts for GSM8K datasetEvilMath
: Scripts for EvilMath dataset
The jb_tax_example_file.py
demonstrates how to use the codebase classes directly for more customized experiments. It shows:
- How to implement custom jailbreak functions
- How to configure experiment for different datasets (WMDP, GSM8K, MATH)
- How to run experiments with different alignment and attack combinations
- How to grade and combine results (Grade Model Outputs)
- How to generate formatted prompts (Generate Formatted Prompts)
The framework includes utilities to automatically grade model outputs for different datasets. You can use the grade_results()
in Utils/jb_tax_grading_utils.py
, which:
- Loads results from a JSON file
- Grades results based on the dataset
- Appends grading information to the results
grade_results(results_path: str, dataset: str) -> None
Parameters:
results_path
: Path to the JSON file containing model responsesdataset
: Dataset type ('gsm8k', 'math', 'wmdp', or 'gsm8k-evil')
Example result files for each dataset type can be found in the ExampleResults
directory, and you can use example scripts to grade them: Scripts/run_grade_results.sh
.
After grading, a grading
field will be added to each result in the input file, indicating whether the answer was correct
, incorrect
, or a refusal
. Additionally, a grading_totals.json
file will be generated in the same directory with statistics on grading results.
You can combine grading results from all three stages to generate single file results and statistics (including the jailbreak tax):
combine_results_files(unaligned_path: str, before_jb_path: str, after_jb_path: str, output_dir: str) -> None
The unaligned_path
, before_jb_path
, and after_jb_path
are the result files after grading.
Alternatively, you can use Scripts/run_combine_stages.sh
.
You can plot the jailbreak tax vs success rate using the Utils/plots_notebook.ipynb
notebook.
You can generate formatted prompts without running the full experiment. The jb_tax_example_file.py
provides an example of how to generate formatted prompts for the GSM8K dataset using the Experiment.generate_formatted_prompts()
method instead of the Experiment.run()
.
We finetune models for two purposes: 1) alignment: we pseudo-alignment the base model against answering math or bio questions, 2) attacks: use SFT as an attack on the pseudo-aligned model (the target alignment here can be SFT or system prompt instruction).
For example, for bio questions, we have three modes:
-
Mode: refuse - This will finetune the model to refuse answering biology questions and save it as "finetuned_model_refuse"
-
Mode: do_bio_again - This will take the refused model and finetune it to answer biology questions again, saving it as "finetuned_model_do_bio_again"
-
Mode: do_bio - This will finetune the original model (with system prompt as a defense) to answer biology questions directly, saving it as "finetuned_model_do_bio"
To use the script, you can run it with the required mode argument:
cd Finetune && python finetune_bio_70b.py --mo
8000
de refuse
Refuse | 8B Model | 70B Model |
---|---|---|
Biology | ethz-spylab/Llama-3.1-8B-Instruct_refuse_bio |
ethz-spylab/Llama-3.1-70B-Instruct_refuse_biology |
Math | ethz-spylab/Llama-3.1-8B-Instruct_refuse_math |
ethz-spylab/Llama-3.1-70B-Instruct_refuse_math |
Do | Targeted Alignment Type | 8B Model | 70B Model |
---|---|---|---|
Biology | System Prompt | ethz-spylab/Llama-3.1-8B-Instruct_do_bio |
ethz-spylab/Llama-3.1-70B-Instruct_do_biology_5e-5 |
Biology | SFT | ethz-spylab/Llama-3.1-8B-Instruct_do_bio_again |
ethz-spylab/Llama-3.1-70B-Instruct_do_biology_again_5e-5 |
Math | System Prompt | ethz-spylab/Llama-3.1-8B-Instruct_do_math_chat |
ethz-spylab/Llama-3.1-70B-Instruct_do_math_chat |
Math | SFT | ethz-spylab/Llama-3.1-8B-Instruct_do_math_again |
ethz-spylab/Llama-3.1-70B-Instruct_do_math_again |
In Data/EvilMath/src
you can find the script we used to generate the EvilMath dataset. The final version is available at ethz-spylab/EvilMath.
If you want to use local version of the dataset, set --use_local_dataset
flag and pass the .json file path to --evil_dataset_path
. E.g.,
--use_local_dataset
--evil_dataset_path "Data/EvilMath/data/evil_math_data_test.json"
We recommend using the ethz-spylab/EvilMath from HuggingFace if you don't make modifications to the dataset.
-
system_prompt
: System prompt alignment. The system prompts can be found inConfig/gsm8k_formats.json
andConfig/question_formats.json
--alignment_method system_prompt --system_prompt_type <system_prompt_type>
Default system prompt types per dataset:
- For WMDP:
refuse_biology
- For GSM8K:
refuse_math
- For MATH:
refuse_math
You can add new system prompt types to the
Config/question_formats.json
orConfig/gsm8k_formats.json
file. - For WMDP:
-
fine_tuned
: Alignment by fine-tuning the target model--alignment_method fine_tuned --alignment_model_path <path_to_model>
-
no_alignment
: No alignment--alignment_method no_alignment
This option runs the model without any alignment techniques. It is used for the EvilMath experiments that rely on the internal safety mechanisms of the model.
-
simple_jb
: System Prompt attack--attack_type simple_jb
-
translate
: MultiJail Attack. Translates prompt to another language--attack_type translate --target_language <language> e.g. French
-
PAIR
: PAIR attack with multiple rounds--attack_type PAIR --max_rounds 20
-
PAIR (don't modify)
: PAIR attack specifically instructed to not modify the original prompt, only the context around the question--attack_type PAIR --max_rounds 20 --dont_modify_question
-
TAP
: Tree of Attacks and Pruning--attack_type TAP --width 5 --branching_factor 3 --depth 5 --keep_last_n 3
-
gcg
: Gradient-based attack--attack_type gcg --gcg_init_str_len 20 --gcg_num_steps 250 --gcg_search_width 128 --gcg_topk 128 --gcg_seed 42 --gcg_early_stop False --target_string <custom_target_string>
Default target strings per dataset:
- For WMDP:
The correct answer is:
- For GSM8K:
Let's solve this problem step by step:
- For MATH:
Let's solve this problem step by step:
- For WMDP:
-
AutoDAN
: Auto-DAN attack--attack_type AutoDAN --autodan_num_steps 50 --autodan_batch_size 8 --autodan_elite_ratio 0.05 --autodan_crossover 0.5 --autodan_mutation 0.01 --autodan_num_points 5 --autodan_iteration_interval 5
Note: see the Special Note for AutoDAN section.
-
finetune
: Fine-tuning attack--attack_type finetune --peft_model_path <path_to_finetuned_model>
-
many-shot
: Many-shot attack--attack_type many-shot --many_shot_num_questions 100 --many_shot_file <path_to_many_shot_file>
-
custom_jb
: Custom jailbreak implementation--attack_type custom_jb
This requires implementing a custom jailbreak function. See
jb_tax_example_file.py
for examples. (Remember to setcustom_jailbreak=True
in the attack config)
Acknowledgements: The jailbreak implementations are based on the original jailbreak repositories.
Use the --api_or_local
parameter to choose between API models (via OpenRouter or OpenAI) or local models:
# For API models
--api_or_local api
--model_provider <provider_name> # Options: 'openrouter' and 'openai'
--select_model <model_name> # e.g. 'openai/gpt-4o-mini'
# For local models
--api_or_local local
--select_model <local_or_hf_model_path> # e.g. 'meta-llama/Meta-Llama-3.1-8B-Instruct'
If you want to add a new model, add it to the list of supported models in the Utils/argument_parser.py
file, get_model_choices()
function.
wmdp-bio
: Biology questions from the WMDP datasetgsm8k
: Grade school math problems from the GSM8K datasetgsm8k-evil
: EvilMath problems generated by rewording the GSM8K dataset to harmful topics (ethz-spylab/EvilMath)math
: MATH dataset with various difficulty levels (set --math_level to 'Level 1', 'Level 2', 'Level 3', 'Level 4', or 'Level 5')
Note: MATH dataset is not available on Huggingface, the download instructions are on github.com/hendrycks/math
In Data/many_shot/
you can find the script we used to generate the many-shot data for each dataset.
--force_overwrite
: don't check if the results file already exists, overwrite it by default.--custom_save_dir
: save the results to a custom directory.--format_type
: choose the prompt format type (seeConfig/question_formats.json
andConfig/gsm8k_formats.json
for available options)
The fschat
library used in the original implementation is not updated for the Llama 3.1 model format. We updated it here: https://github.com/nkristina/FastChat
To run AutoDAN jailbreak, install the fschat
library from this fork:
First, remove fschat
and fastchat
from your environment.
pip uninstall fastchat
pip uninstall fschat
Clone the fork (can be out of Jailbreak Tax repo)
git clone https://github.com/nkristina/FastChat.git
cd FastChat
pip3 install --upgrade pip # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"
Now you can navigate back to the Jailbreak Tax repo and run the experiment with AutoDAN jailbreak.
JailbreakTax/
├── Config/ # System prompts, grading and question formats
├── Data/
│ ├── EvilMath/ # Evil math dataset generation and processing
│ └── many_shot/ # Many-shot attack examples and generation scripts
│ └── result-formats/ # Example of input format for grading scripts
├── Finetune/ # Model fine-tuning scripts
├── Scripts/ # Example run scripts for different datasets and attacks
├── Utils/ # Core utilities for jailbreak experiments and evaluation
├── run_jb_tax.py # Main script for running jailbreak tax experiments
├── jb_tax_example_file.py # Example usage of the framework's classes
└── requirements.txt # Required dependencies
If you use this code in your research, please cite the following paper:
@inproceedings{
nikolic2025the,
title={The Jailbreak Tax: How Useful are Your Jailbreak Outputs?},
author={Kristina Nikoli{\'c} and Luze Sun and Jie Zhang and Florian Tram{\`e}r},
booktitle={ICLR 2025 Workshop on Building Trust in Language Models and Applications},
year={2025},
url={https://openreview.net/forum?id=VSSQud4diJ}
}