The official implementation for <Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning>
Install pacakges with environment.yml
file
conda env create -f environment.yml
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld
To install packages manually,
conda create -n appo python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install tensorboard ipykernel matplotlib seaborn
pip install "gym[mujoco_py,classic_control]==0.23.0"
pip install pyrallis tqdm
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld
Meta-world medium-replay
dataset is available in the official repository of LiRE. Meta-world medium-expert
dataset was collected by the code provided in the official repository of IPL.
The parameters are in the configuration files under configs/
.
Set learning rates, network architectures, batch sizes, and other algorithmic hyperparameter by modifying config files.
To train reward model in dial-turn task,
python reward_learning/learn_reward.py --config=configs/dial-turn-v2/reward.yaml
To train APPO in dial-turn task,
python appo.py --config=configs/dial-turn-v2/appo.yaml
To train MR in dial-turn task,
python mr.py --config=configs/dial-turn-v2/mr.yaml
The training results are stored in log/
.
All experiments were run for 5 random seeds each and learning curves are smoothed by exponential averaging with factor 0.5.
Plots are created with plotter.ipynb
.
Our code is based on the official implementation of <Listwise Reward Estimation for Offline Preference-based Reinforcement Learning> (Choi et al., 2024) : https://github.com/chwoong/LiRE