8000 GitHub - luk-s/MisleadLM: A re-implementation and extension of the code of the paper: "Language Models Learn to Mislead Humans via RLHF""
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A re-implementation and extension of the code of the paper: "Language Models Learn to Mislead Humans via RLHF""

License

Notifications You must be signed in to change notification settings

luk-s/MisleadLM

 
 

Repository files navigation

Language Models Learn to Mislead Humans via RLHF

This repository is based on the codebase of the paper:

Language Models Learn to Mislead Humans via RLHF

and extends it to investigate reward hacking in RLAIF.

1. Setup

1.1 Request access to the gated Llama models

1.2 Download all files stored with Git Large File Storage

git lfs install      # Install git lfs
git lfs fetch --all  # Fetch all large files
git lfs checkout     # Teplace the pointer files

1.3 Setup the python environment

conda create -n mislead python=3.10
conda activate mislead
pip install -e .

1.4 Log in to necessary services

WeightsAndBiases:

  • Follow steps 1 and 2 of this quickstart.
  • Check that the wandb API key is set as environment variable by running echo $WANDB_API_KEY

Huggingface:

2. RLHF Training

2.1 Question Answering

# 0. Go to the scripts folder
cd experiments/qa/scripts

# 1. Train a reward model
bash reward_model_general_train.sh # general reward training
# (Optional) Edit the variable 'REWARD_MODELS' in the file `experiments/qa/reward_model_general_server.py` to add the path to your newly trained reward model

# 2. Fine-tune a pre-trained model
bash agent_train.sh

# 3. Start the reward API server and run it as a background process
bash reward_model_general_server.sh

# 4. Start the RL agent training
bash agent_train.sh

2.2 Programming

TODO

About

A re-implementation and extension of the code of the paper: "Language Models Learn to Mislead Humans via RLHF""

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.8%
  • Shell 4.2%
0