This repository contains the code, dataset and instructions for the training and evaluation used in the work How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study.
- Environment
- Codebase Directory
- Training Dataset Preparing
- Safety Fine-tuning
- ASR and Over-Refusal Evaluation
- Reasoning Performance Evaluation
pip install -r requirements.txt
Directory | Description |
---|---|
./attack_refusal_evaluation |
Scripts to evaluate model safety via refusal behavior on PAIR, PAP, NONE, and XSTest benchmarks. |
./reasoning_evaluation |
Evaluation scripts for reasoning performance on math and code benchmarks: MATH-500, AIME 2024, and LiveCodeBench. |
./dataset |
Contai B051 ns raw data and preprocessing scripts for generating SFT-ready datasets. |
./safety_train |
Training code and datasets for SFT across different safety reasoning settings. |
The safety and math reasoning samples are stored in: ./dataset/data
. For each safety CoT data file, we include 1,000 samples. And the math CoT data file contains 4,000 samples. You can also download the raw dataset from Hugging Face.
MATH
: 4,000 mathematical reasoning examples.Default CoT
: 1,000 safety-related examples using the default CoT prompting.RealSafe CoT
: 1,000 safety-related examples with RealSafe-style CoT prompting.Improved CoT
: 1,000 safety-related examples with Improved CoT prompting.Short CoT
: 1,000 safety-related examples with Short CoT prompting.Template CoT
: 1,000 safety-related examples using Template CoT prompting.No CoT
: 1,000 safety-related examples without any CoT.
instruction
(optional): The system instruction or instruction prepended to the promptprompt
: The promptresponse
: The response from the modelsolution
(optional): The ground truth solution for the questiondata_type
: The category of the data, either "safety" for safety-related prompts or "math" for mathematical queries.
Use proc.py
to process the dataset:
- Set mode = 0 to generate SFT-formatted data for a specific settings. There are 6 reasoning safety settings, each with 2 variants based on the number of safety samples:
- 1,000 safety + 4,000 math samples
- 400 safety + 4,000 math samples
mode = 0
safety_type = <1 to 6> # Choose which of the 6 safety settings
safety_count = 1000 or 400 # Number of safety samples
- Set mode = 1 to check the lengths of the data of a specific settings.
All SFT datasets generated using ./dataset/proc.py
are located in:
./safety_train/data/sft
- Set the training data path, model path, and save_path in:
./safety_train/trl_train_code/run_sft.sh
- Launch training with:
bash run_sft.sh
This section details the evaluation process for the 3 tests for Safety Performance Evaluation (PAIR, PAP, None) and 1 test for Over-Refusal Evaluation (XSTest).
To perform the PAIR evaluation:
First, set the model path and GPU ID you want to evaluate in run_pair.sh
like:
model_path GPU_ID
"checkpoint_path 7"
Then, run the evaluation script.
cd attack_refusal_evaluation
bash run_pair.sh
After the run completes, set the model path in pair_res.sh
.
Finally, run the evaluation script to process the results.
cd attack_refusal_evaluation
bash pair_res.sh
To run this two attack tests, simply set the model_paths and out_names in gen_pap_no.sh
and then run it
cd attack_refusal_evaluation
bash gen_pap_no.sh
To run the over-refusl evaluation, simply set the model_path and out_names in run_xstest.sh
then run it
cd attack_refusal_evaluation
bash run_xstest.sh
This directory contains the evaluation scripts for three benchmarks used in the Reasoning Performance Evaluation: MATH-500, AIME 2024, and LiveCodeBench.
To evaluate on these two benchmarks:
- Set the
model_name_or_paths
ineval_math.sh
. - Run the following command:
cd reasoning_evaluation
bash eval_math.sh
To evaluate on LiveCodeBench:
- clone the dataset repository into the target directory:
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite reasoning_safety/reasoning_evaluation/LiveCodeBench
- Set the GPUs, model_paths and model_names in
eval_code.sh
. - Run the evaluation script:
cd reasoning_evaluation
bash eval_code.sh