-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR overhauls the examples folder for multi-agent learning with global observations, integrating a new environment, a customized RLModule with multiple value function heads, and corresponding learner changes. Key changes include:
- A new RLModule handling global observations and multiple VF heads.
- A multi-agent CartPole environment with global observations and per-agent rewards.
- Updates to learner connectors and loss computations to support the new global and multi-head design.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
rllib/examples/rl_modules/classes/global_observations_many_vf_heads_rlm.py | Introduces the new RLModule with an encoder, a multi-discrete (global) policy head, and multiple value function heads. |
rllib/examples/multi_agent/global_observations_and_central_model.py | Sets up the multi-agent experiment with a global policy and a rewards connector constructing per-agent rewards. |
rllib/examples/learners/classes/ppo_global_observations_many_vf_learner.py | Implements a custom PPO learner to compute loss and GAE for multiple value heads. |
rllib/examples/envs/classes/multi_agent/two_agent_cartpole_with_global_observations.py | Defines the custom two-agent CartPole environment with global observations and individual reward streams. |
rllib/evaluation/postprocessing.py | Updates postprocessing to use the centralized Columns constants. |
rllib/core/rl_module/torch/torch_rl_module.py | Adds support for MultiDiscrete action spaces via TorchMultiCategorical. |
rllib/core/columns.py | Refines and clarifies column constant definitions. |
rllib/connectors/module_to_env/get_actions.py | Comments out the action log probability assignment, potentially for debugging or design reasons. |
rllib/connectors/learner/general_advantage_estimation.py | Refactors GAE computations to use updated Columns constants. |
rllib/algorithms/ppo/torch/ppo_torch_learner.py | Adjusts loss computation to reference updated column names. |
rllib/algorithms/marwil/torch/marwil_torch_learner.py | Applies similar column name updates in loss logging. |
# If you need more granularity between the different forward behaviors during | ||
# the different phases of the module's lifecycle, implement three different | ||
# forward methods. Thereby, it is recommended to put the inference and | ||
# exploration versions inside a `with torch.no_grad()` context for better | ||
# performance. | ||
# def _forward_train(self, batch): | ||
# ... | ||
# | ||
# def _forward_inference(self, batch): | ||
# with torch.no_grad(): | ||
# return self._forward_train(batch) | ||
# | ||
# def _forward_exploration(self, batch): | ||
# with torch.no_grad(): | ||
# return self._forward_train(batch) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out alternative _forward methods appearing after the return statement are unreachable. Consider moving these notes to the module's docstring or removing them to improve code clarity.
# If you need more granularity between the different forward behaviors during | |
# the different phases of the module's lifecycle, implement three different | |
# forward methods. Thereby, it is recommended to put the inference and | |
# exploration versions inside a `with torch.no_grad()` context for better | |
# performance. | |
# def _forward_train(self, batch): | |
# ... | |
# | |
# def _forward_inference(self, batch): | |
# with torch.no_grad(): | |
# return self._forward_train(batch) | |
# | |
# def _forward_exploration(self, batch): | |
# with torch.no_grad(): | |
# return self._forward_train(batch) |
Copilot uses AI. Check for mistakes.
#if Columns.ACTION_LOGP not in batch: | ||
# batch[Columns.ACTION_LOGP] = action_dist.logp(actions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] If the removal of the action log probability assignment is intentional, please remove the commented-out code to avoid confusion.
#if Columns.ACTION_LOGP not in batch: | |
# batch[Columns.ACTION_LOGP] = action_dist.logp(actions) | |
# Removed commented-out code to avoid confusion. |
Copilot uses AI. Check for mistakes.
#reward_agent_right += -0.1 | ||
#reward_agent_left += -0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider either removing or clarifying the commented-out negative reward adjustments if they are no longer needed, to reduce potential confusion.
#reward_agent_right += -0.1 | |
#reward_agent_left += -0.1 | |
# Removed commented-out negative reward adjustments to reduce confusion. |
Copilot uses AI. Check for mistakes.
…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>
…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
…icy (outputting multi-discrete actions), and separate reward streams and value heads.
Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards.
MultiDiscrete
space, where each slot represents the individual action for each of the 2 agents), and individual rewards published through theinfos
dicts. Note that the "global" rewards in the env are dummy rewards and should not be used for training.infos
.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.