How Should We Enhance the Safety of Large
Reasoning Models: An Empirical Study

This repository contains the code, dataset and instructions for the training and evaluation used in the work How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study.

Environment

pip install -r requirements.txt

Codebase Directory

Directory	Description
`./attack_refusal_evaluation`	Scripts to evaluate model safety via refusal behavior on PAIR, PAP, NONE, and XSTest benchmarks.
`./reasoning_evaluation`	Evaluation scripts for reasoning performance on math and code benchmarks: MATH-500, AIME 2024, and LiveCodeBench.
`./dataset`	Contai B051 ns raw data and preprocessing scripts for generating SFT-ready datasets.
`./safety_train`	Training code and datasets for SFT across different safety reasoning settings.

Training Dataset Preparing

Raw Dataset

The safety and math reasoning samples are stored in: ./dataset/data. For each safety CoT data file, we include 1,000 samples. And the math CoT data file contains 4,000 samples. You can also download the raw dataset from Hugging Face.

Data Splits

MATH: 4,000 mathematical reasoning examples.
Default CoT: 1,000 safety-related examples using the default CoT prompting.
RealSafe CoT: 1,000 safety-related examples with RealSafe-style CoT prompting.
Improved CoT: 1,000 safety-related examples with Improved CoT prompting.
Short CoT: 1,000 safety-related examples with Short CoT prompting.
Template CoT: 1,000 safety-related examples using Template CoT prompting.
No CoT: 1,000 safety-related examples without any CoT.

Dataset Fields

instruction (optional): The system instruction or instruction prepended to the prompt
prompt: The prompt
response: The response from the model
solution (optional): The ground truth solution for the question
data_type: The category of the data, either "safety" for safety-related prompts or "math" for mathematical queries.

Processing Data

Use proc.py to process the dataset:

Set mode = 0 to generate SFT-formatted data for a specific settings. There are 6 reasoning safety settings, each with 2 variants based on the number of safety samples:

1,000 safety + 4,000 math samples
400 safety + 4,000 math samples

mode = 0
safety_type = <1 to 6>         # Choose which of the 6 safety settings
safety_count = 1000 or 400     # Number of safety samples

Set mode = 1 to check the lengths of the data of a specific settings.

Safety Fine-tuning

SFT Dataset

All SFT datasets generated using ./dataset/proc.py are located in: ./safety_train/data/sft

Train code

Set the training data path, model path, and save_path in: ./safety_train/trl_train_code/run_sft.sh
Launch training with:

bash run_sft.sh

ASR and Over-Refusal Evaluation

This section details the evaluation process for the 3 tests for Safety Performance Evaluation (PAIR, PAP, None) and 1 test for Over-Refusal Evaluation (XSTest).

1. PAIR Evaluation

To perform the PAIR evaluation:

First, set the model path and GPU ID you want to evaluate in run_pair.sh like:

model_path GPU_ID
"checkpoint_path 7"

Then, run the evaluation script.

cd attack_refusal_evaluation
bash run_pair.sh

After the run completes, set the model path in pair_res.sh. Finally, run the evaluation script to process the results.

cd attack_refusal_evaluation
bash pair_res.sh

2. PAP and None (No Attack) Test

To run this two attack tests, simply set the model_paths and out_names in gen_pap_no.sh and then run it

cd attack_refusal_evaluation
bash gen_pap_no.sh

3. XSTest Evaluation

To run the over-refusl evaluation, simply set the model_path and out_names in run_xstest.sh then run it

cd attack_refusal_evaluation
bash run_xstest.sh

Reasoning Performance Evaluation

This directory contains the evaluation scripts for three benchmarks used in the Reasoning Performance Evaluation: MATH-500, AIME 2024, and LiveCodeBench.

1. MATH-500 and AIME 2024

To evaluate on these two benchmarks:

Set the model_name_or_paths in eval_math.sh.
Run the following command:

cd reasoning_evaluation
bash eval_math.sh

2. LiveCodeBench

To evaluate on LiveCodeBench:

clone the dataset repository into the target directory:

git clone https://huggingface.co/datasets/livecodebench/code_generation_lite reasoning_safety/reasoning_evaluation/LiveCodeBench

Set the GPUs, model_paths and model_names in eval_code.sh.
Run the evaluation script:

cd reasoning_evaluation
bash eval_code.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Should We Enhance the Safety of Large
Reasoning Models: An Empirical Study

Table of Contents

Environment

Codebase Directory

Training Dataset Preparing

Raw Dataset

Data Splits

Dataset Fields

Processing Data

Safety Fine-tuning

SFT Dataset

Train code

ASR and Over-Refusal Evaluation

1. PAIR Evaluation

2. PAP and None (No Attack) Test

3. XSTest Evaluation

Reasoning Performance Evaluation

1. MATH-500 and AIME 2024

2. LiveCodeBench

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
attack_refusal_evaluation		attack_refusal_evaluation
dataset		dataset
reasoning_evaluation		reasoning_evaluation
safety_train/trl_train_code		safety_train/trl_train_code
trl		trl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

thu-coai/LRM-Safety-Study

Folders and files

Latest commit

History

Repository files navigation

How Should We Enhance the Safety of LargeReasoning Models: An Empirical Study

Table of Contents

Environment

Codebase Directory

Training Dataset Preparing

Raw Dataset

Data Splits

Dataset Fields

Processing Data

Safety Fine-tuning

SFT Dataset

Train code

ASR and Over-Refusal Evaluation

1. PAIR Evaluation

2. PAP and None (No Attack) Test

3. XSTest Evaluation

Reasoning Performance Evaluation

1. MATH-500 and AIME 2024

2. LiveCodeBench

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

How Should We Enhance the Safety of Large
Reasoning Models: An Empirical Study

Packages