GitHub - luk-s/MisleadLM: A re-implementation and extension of the code of the paper: "Language Models Learn to Mislead Humans via RLHF""

Language Models Learn to Mislead Humans via RLHF

This repository is based on the codebase of the paper:

Language Models Learn to Mislead Humans via RLHF

and extends it to investigate reward hacking in RLAIF.

1. Setup

1.1 Request access to the gated Llama models

Visit https://huggingface.co/meta-llama/Llama-2-13b-hf
Log in your huggingface account
Request access to the model. This step might take ~2 hours.

1.2 Download all files stored with Git Large File Storage

git lfs install      # Install git lfs
git lfs fetch --all  # Fetch all large files
git lfs checkout     # Teplace the pointer files

1.3 Setup the python environment

conda create -n mislead python=3.10
conda activate mislead
pip install -e .

1.4 Log in to necessary services

WeightsAndBiases:

Follow steps 1 and 2 of this quickstart.
Check that the wandb API key is set as environment variable by running echo $WANDB_API_KEY

Huggingface:

Follow the Download files section of this tutorial

2. RLHF Training

2.1 Question Answering

# 0. Go to the scripts folder
cd experiments/qa/scripts

# 1. Train a reward model
bash reward_model_general_train.sh # general reward training
# (Optional) Edit the variable 'REWARD_MODELS' in the file `experiments/qa/reward_model_general_server.py` to add the path to your newly trained reward model

# 2. Fine-tune a pre-trained model
bash agent_train.sh

# 3. Start the reward API server and run it as a background process
bash reward_model_general_server.sh

# 4. Start the RL agent training
bash agent_train.sh

2.2 Programming

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
OLD_examples		OLD_examples
experiments		experiments
trlx		trlx
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Models Learn to Mislead Humans via RLHF

1. Setup

1.1 Request access to the gated Llama models

1.2 Download all files stored with Git Large File Storage

1.3 Setup the python environment

1.4 Log in to necessary services

2. RLHF Training

2.1 Question Answering

2.2 Programming

About

Uh oh!

Releases

Packages

Languages

License

luk-s/MisleadLM

Folders and files

Latest commit

History

Repository files navigation

Language Models Learn to Mislead Humans via RLHF

1. Setup

1.1 Request access to the gated Llama models

1.2 Download all files stored with Git Large File Storage

1.3 Setup the python environment

1.4 Log in to necessary services

2. RLHF Training

2.1 Question Answering

2.2 Programming

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages