Genglin's Notes

Basically you need to follow the setup instruction first:

conda create -n safety-eval python=3.10 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.4.2

The eval has two parts:

inference with SGLang
evaluate with a classifier

For the first part, we need to install SGLang:

pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Then you can launch the server with:

CUDA_VISIBLE_DEVICES=2 python -m sglang.launch_server --model-path Qwen/Qwen2.5-32B-Instruct --port=PORT_NUMBER

Okay. Now we do the actual evaluation.

In src/dispatch_openai_requests.py, you need to change the BASE_URL to the port number of the server you launched.

BASE_URL = "http://localhost:30010/v1"

aclient = AsyncOpenAI(base_url=BASE_URL, api_key="dummy")
client = OpenAI(base_url=BASE_URL, api_key="dummy")

So now the script will still "think" it's using OpenAI but we are actually using local model on SGLang.

This is a script that I use to evaluate (you need to change the task and output path, you can keep the other args unchanged):

(remember that you change your actual model in the sglang server not here. Here we just say openai:gpt-4 so that the script can identify it as an openai model)

bigger batch size could speed up but you need to make sure your GPU memory is enough.

python evaluation/eval.py generators  \
 --model_name_or_path openai:gpt-4 \
 --model_input_template_path_or_name tulu2  \
 --tasks bbh:no_cot  \
 --report_output_path ./generation_results/llama3.1-8b-base-tulu-10008-previous-data/metrics.json \
 --save_individual_results_path ./generation_results/llama3.1-8b-base-tulu-10008-previous-data/all.json  \
 --batch_size 128

Another note:

By default the script will use the Harmbench classifier. If we want to use other classifier, we can change the classifier_model_name in the yaml file.

For the following safety evals (do_anything_now, harmbench, wildguardtest, xstest, trustllm_jailbreaktrigger), I have already changed the classifier to a keyword-based classifier KeywordBasedRefusalClassifier.

evaluation/tasks/generation/harmbench/default.yaml

I did this because an LLM-based classifier takes another gpu to load. If compute permits, we can switch back to the Harmbench classifier (but we should make it consistent across all tasks and models).

[end of Genglin's notes]

Ai2 Safety Tool 🧰 (Evaluation Suite)

This repository contains code for easy and comprehensive safety evaluation on generative LMs and safety moderation tools. This evaluation framework is used in safety projects at Ai2, including:

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

This lets you to display how the performance changes across more than 14 benchmarks as below 🔥

Updates

2024-07-06: add support for MMLU, TruthfulQA, and sorrybench classifier.

Features

Easy evaluation of generative language models on list of safety benchmarks and general capabilities benchmarks such as MT-Bench and AlpacaEval2.
Easy evaluation of safety classifiers on three tasks (detection of prompt harmfulness, response harmfulness, and response refusal) across 10+ benchmarks.
Easy addition of new benchmarks and models to the evaluation suite.

Installation

conda create -n safety-eval python=3.10 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.4.2

Generative LM Evaluation

Safety benchmarks

For all benchmarks requiring safety classification unless noted otherwise, as a default, we use the WildGuard classifier to evaluate the safety of model outputs.

WildGuardTest
Harmbench
ToxiGen: use tomh/toxigen_roberta as the classifier
XSTest
JailbreakTrigger (in TrustLLM)
Do-anything-now
WildJailbreak (both harmful and benign contrast sets)

Changing classifiers for safety benchmarks:

You can change the safety classifier used for evaluation by specifying the classifier_model_name in the yaml file. For example, when you want to use the HarmBench's classifiers for evaluation on HarmBench, you can use HarmbenchClassifier as the classifier_model_name. Please check out the evaluation/tasks/generation/harmbench/default.yaml and evaluation/tasks/classification/harmbench/harmbench_classsifier.yaml to see the classifier's specification.

# evaluation/tasks/classification/harmbench/harmbench_classsifier.yaml
task_class: HarmbenchVanilla
classifier_model_name: HarmbenchClassifier

# evaluation/tasks/generation/harmbench/default.yaml
task_class: HarmbenchVanilla
classifier_model_name: WildGuard

Please refer to src/classifier_models/ directory to explore the classifiers implementation.

General capabilities benchmarks

Optimal safety training maintains or even improves models' general capabilities. We include general capability evaluation for monitoring this dimension of safety training.

Support for additional benchmarks, including IFEval, and TyDiQA is in progress. For TydiQA, please use open-instruct to evaluate the models for now.

How-to-use

Below are commands to run safety and general capability benchmarking for generative LMs. The first command can be used to run all included benchmarks for models which support vLLM. The second command can be used to select individual benchmarks for evaluation. To specify a task, the syntax is <folder>:<config_yaml>, where folder is a folder under tasks/generation and config_yaml is the name of the configuration yaml file excluding .yaml.

# run all generation benchmarks by a single command. assume you are using vllm. 
# no
87F7
te that you should add OPENAI_API_KEY to your environment variables when you use mtbench and alpacaeval2.
export CUDA_VISIBLE_DEVICES={NUM_GPUS};
python evaluation/run_all_generation_benchmarks.py \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --model_input_template_path_or_name tulu2 \
    --report_output_path ./generation_results/metrics.json \
    --save_individual_results_path ./generation_results/all.json
    
# run specific generation benchmarks by a single command. here, we use three benchmarks.
python evaluation/eval.py generators \
  --use_vllm \
  --model_name_or_path allenai/tulu-2-dpo-7b \
  --model_input_template_path_or_name tulu2 \
  --tasks wildguardtest,harmbench,toxigen:tiny \
  --report_output_path ./generation_results/metrics.json \
  --save_individual_results_path ./generation_results/all.json

# run an OpenAI API model specific generation benchmarks by a single command. here, we use three benchmarks.
python evaluation/eval.py generators \
  --model_name_or_path openai:gpt-4 \
  --model_input_template_path_or_name None \
  --tasks wildguardtest,harmbench,toxigen:tiny \
  --report_output_path ./generation_results/metrics.json \
  --save_individual_results_path ./generation_results/all.json

Safety Classifier Evaluation

Prompt harmfulness benchmarks

Response harmfulness benchmarks

Response refusal benchmarks

How-to-use

The commands below allow for running benchmarks to evaluate quality of safety classifiers such as WildGuard and LlamaGuard. The first command can be used to run all included benchmarks, while the second can be used to run select benchmarks. Similar to generation evals, to specify a task, the syntax is <folder>:<config_yaml>, where folder is a folder under tasks/classificaiton and config_yaml is the name of the configuration yaml file excluding .yaml.


# run all classification benchmarks by a single command

export CUDA_VISIBLE_DEVICES={NUM_GPUS};
python evaluation/run_all_classification_benchmarks.py \
    --model_name WildGuard \
    --report_output_path ./classification_results/metrics.json \
    --save_individual_results_path ./classification_results/all.json

# run specific classification benchmarks by a single command. here, we use four benchmarks

python evaluation/eval.py classifiers \
  --model_name WildGuard \
  --tasks wildguardtest_prompt,wildguardtest_response,wildguardtest_refusal,openai_mod \
  --report_output_path ./classification_results/metrics.json \
  --save_individual_results_path ./classification_results/all.json

Acknowledgements

This repository uses some code from the:

Harmbench -- in particular, code for model input templates,
Open-instruct -- in particular, code for model generation (general capabilities) benchmarks.

Citation

@misc{wildteaming2024,
      title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models}, 
      author={Liwei Jiang and Kavel Rao and Seungju Han and Allyson Ettinger and Faeze Brahman and Sachin Kumar and Niloofar Mireshghallah and Ximing Lu and Maarten Sap and Yejin Choi and Nouha Dziri},
      year={2024},
      eprint={2406.18510},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18510}, 
}

@misc{wildguard2024,
      title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, 
      author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},
      year={2024},
      eprint={2406.18495},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18495}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
evaluation		evaluation
generation_results		generation_results
safety_eval.egg-info		safety_eval.egg-info
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genglin's Notes

Ai2 Safety Tool 🧰 (Evaluation Suite)

Updates

Features

Installation

Generative LM Evaluation

Safety benchmarks

General capabilities benchmarks

How-to-use

Safety Classifier Evaluation

Prompt harmfulness benchmarks

Response harmfulness benchmarks

Response refusal benchmarks

How-to-use

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

genglinliu/safety-eval

Folders and files

Latest commit

History

Repository files navigation

Genglin's Notes

Ai2 Safety Tool 🧰 (Evaluation Suite)

Updates

Features

Installation

Generative LM Evaluation

Safety benchmarks

General capabilities benchmarks

How-to-use

Safety Classifier Evaluation

Prompt harmfulness benchmarks

Response harmfulness benchmarks

Response refusal benchmarks

How-to-use

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages