8000 GitHub - oh-lab/APPO: Official code for ICLR'25 paper [Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning]
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ APPO Public

Official code for ICLR'25 paper [Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning]

Notifications You must be signed in to change notification settings

oh-lab/APPO

Repository files navigation

APPO : Adversarial Preference-based Policy Optimization

The official implementation for <Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning>

Dependencies

Install pacakges with environment.yml file

conda env create -f environment.yml
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld

To install packages manually,

conda create -n appo python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install tensorboard ipykernel matplotlib seaborn
pip install "gym[mujoco_py,classic_control]==0.23.0"
pip install pyrallis tqdm
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld

Datasets

Meta-world medium-replay dataset is available in the official repository of LiRE. Meta-world medium-expert dataset was collected by the code provided in the official repository of IPL.

Training

The parameters are in the configuration files under configs/. Set learning rates, network architectures, batch sizes, and other algorithmic hyperparameter by modifying config files.

To train reward model in dial-turn task,

python reward_learning/learn_reward.py --config=configs/dial-turn-v2/reward.yaml

To train APPO in dial-turn task,

python appo.py --config=configs/dial-turn-v2/appo.yaml

To train MR in dial-turn task,

python mr.py --config=configs/dial-turn-v2/mr.yaml

Results

The training results are stored in log/. All experiments were run for 5 random seeds each and learning curves are smoothed by exponential averaging with factor 0.5. Plots are created with plotter.ipynb.

Reference

Our code is based on the official implementation of <Listwise Reward Estimation for Offline Preference-based Reinforcement Learning> (Choi et al., 2024) : https://github.com/chwoong/LiRE

About

Official code for ICLR'25 paper [Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0