-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[RLlib; Offline RL] Implement Offline Policy Evaluation (OPE) via Importance Sampling. #53702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and
privacy statement. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[RLlib; Offline RL] Implement Offline Policy Evaluation (OPE) via Importance Sampling. #53702
Conversation
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements Offline Policy Evaluation (OPE) via Importance Sampling for the Offline RL API. Key changes include the introduction of new runner and pre-evaluator classes for OPE, updates to the evaluation configuration and processing in the algorithm logic, and adjustments to logging and state management for offline evaluation.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
rllib/utils/runners/runner_group.py | Added forwarding of kwargs in runner creation. |
rllib/tuned_examples/bc/cartpole_bc_with_offline_evaluation.py | Configured offline evaluation type for the cartpole example. |
rllib/offline/offline_evaluation_runner_group.py | Updated runner class selection logic and introduced pre-learner/evaluator assignment. |
rllib/offline/offline_evaluation_runner.py | Applied override annotations and removed state updates for deprecated connectors. |
rllib/env/single_agent_env_runner.py | Renamed metric keys for per-agent and per-module returns. |
rllib/algorithms/algorithm_config.py | Added and validated new offline evaluation configuration attributes. |
rllib/algorithms/algorithm.py | Updated offline evaluation runner setup and return value structure while introducing a local-runner fallback. |
Comments suppressed due to low confidence (2)
rllib/algorithms/algorithm_config.py:2992
- The attribute name used here ('offline_eval_runner_cls') is inconsistent with the previously defined 'offline_eval_runner_class'. Consider using the same attribute name for consistency.
self.offline_eval_runner_cls = offline_eval_runner_class
rllib/algorithms/algorithm.py:1159
- The return structure of 'evaluate_offline' has changed compared to the previous nested format. Please ensure that downstream consumers are updated to handle the new dictionary structure.
return {OFFLINE_EVAL_RUNNER_RESULTS: eval_results}
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
@@ -2829,6 +2833,13 @@ def evaluation( | |||
for parallel evaluation. Setting this to 0 forces sampling to be done in the | |||
local OfflineEvaluationRunner (main process or the Algorithm's actor when | |||
using Tune). | |||
offline_evaluation_type: Type of offline evaluation to run. Either `"eval_loss"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: So, if a user provides offline_eval_runner_class
, then the value of this field is ignored?
For more explicitness, should we not provide these 3 built-ins ("eval_loss", "is", "pdis") as classes as well and show users, where to find them in the repo? Then this config setting would be superfluous. Or do you think it's too complicated to explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good one. Let me think about this. Both solutions have their advantages.
@@ -1363,6 +1366,38 @@ def _evaluate_with_custom_eval_function(self) -> Tuple[ResultDict, int, int]: | |||
|
|||
return eval_results, env_steps, agent_steps | |||
|
|||
def _evaluate_offline_on_local_runner(self): | |||
# if hasattr(env_runner, "input_reader") and env_runner.input_reader is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah! How did this even get in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with one question. Thanks @simonsays1980 !
Why are these changes needed?
The Offline RL API in th new API stack still misses offline policy evaluation - although it offers a validation loss. This PR introduces OPE and implements the following:
OfflinePolicyEvaluationRunner
that derives from ourOfflineEvaluationRunner
and can be scheduled by ourOfflineEvaluationRunnerGroup
(users can also implement their own runner class for custom evaluation).OfflinePolicyPreEvaluator
that preprocesses data for OPE.AlgorithmConfig
to control offline evaluation:offline_evaluation_type
: the evaluation type. Can be either"eval_loss"
,"pdis"
,"is"
.offline_eval_runner_class
: the runner class to use for offline evaluation. This can be a custom class. If no class is given the standard classes are used for the different evaluation types."is"
: Ordinary importance sampling."pdis"
: Per-decision importance sampling (which inhibits usually a lower variance than simple IS).What is still missing:
SingleAgentEpisode
data.EnvToModule
pipeline inside of theOfflinePolicyPreEvaluator
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.