8000 [RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. by sven1977 · Pull Request #53697 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sven1977
Copy link
Contributor

…icy (outputting multi-discrete actions), and separate reward streams and value heads.

Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards.

  • Main example script.
  • Custom new Env, which acts as a 2 player multi-agent env, but only exposes one agent ("global"), with "global" observations and handling "global" combined actions (a MultiDiscrete space, where each slot represents the individual action for each of the 2 agents), and individual rewards published through the infos dicts. Note that the "global" rewards in the env are dummy rewards and should not be used for training.
  • Custom LearnerConnector creating new columns "rewards_agent0" and "rewards_agent1" from the infos.
  • Custom RLModule handling the combined action space (MultiDiscrete) and 2 value heads (one for each agent).
  • Custom GAE connector for performing n GAE operations for a single module.
  • Custom Learner with custom loss computing the total loss as a sum of the individual agents' loss terms (derived from GAE on each individual reward stream and value function head).

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…icy (outputting multi-discrete actions), and separate reward streams and value heads.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@Copilot Copilot AI review requested due to automatic review settings June 10, 2025 12:14
@sven1977 sven1977 requested a review from a team as a code owner June 10, 2025 12:14
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR overhauls the examples folder for multi-agent learning with global observations, integrating a new environment, a customized RLModule with multiple value function heads, and corresponding learner changes. Key changes include:

  • A new RLModule handling global observations and multiple VF heads.
  • A multi-agent CartPole environment with global observations and per-agent rewards.
  • Updates to learner connectors and loss computations to support the new global and multi-head design.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
rllib/examples/rl_modules/classes/global_observations_many_vf_heads_rlm.py Introduces the new RLModule with an encoder, a multi-discrete (global) policy head, and multiple value function heads.
rllib/examples/multi_agent/global_observations_and_central_model.py Sets up the multi-agent experiment with a global policy and a rewards connector constructing per-agent rewards.
rllib/examples/learners/classes/ppo_global_observations_many_vf_learner.py Implements a custom PPO learner to compute loss and GAE for multiple value heads.
rllib/examples/envs/classes/multi_agent/two_agent_cartpole_with_global_observations.py Defines the custom two-agent CartPole environment with global observations and individual reward streams.
rllib/evaluation/postprocessing.py Updates postprocessing to use the centralized Columns constants.
rllib/core/rl_module/torch/torch_rl_module.py Adds support for MultiDiscrete action spaces via TorchMultiCategorical.
rllib/core/columns.py Refines and clarifies column constant definitions.
rllib/connectors/module_to_env/get_actions.py Comments out the action log probability assignment, potentially for debugging or design reasons.
rllib/connectors/learner/general_advantage_estimation.py Refactors GAE computations to use updated Columns constants.
rllib/algorithms/ppo/torch/ppo_torch_learner.py Adjusts loss computation to reference updated column names.
rllib/algorithms/marwil/torch/marwil_torch_learner.py Applies similar column name updates in loss logging.

Comment on lines 74 to 89
# If you need more granularity between the different forward behaviors during
# the different phases of the module's lifecycle, implement three different
# forward methods. Thereby, it is recommended to put the inference and
# exploration versions inside a `with torch.no_grad()` context for better
# performance.
# def _forward_train(self, batch):
# ...
#
# def _forward_inference(self, batch):
# with torch.no_grad():
# return self._forward_train(batch)
#
# def _forward_exploration(self, batch):
# with torch.no_grad():
# return self._forward_train(batch)

Copy link
Preview
Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out alternative _forward methods appearing after the return statement are unreachable. Consider moving these notes to the module's docstring or removing them to improve code clarity.

Suggested change
# If you need more granularity between the different forward behaviors during
# the different phases of the module's lifecycle, implement three different
# forward methods. Thereby, it is recommended to put the inference and
# exploration versions inside a `with torch.no_grad()` context for better
# performance.
# def _forward_train(self, batch):
# ...
#
# def _forward_inference(self, batch):
# with torch.no_grad():
# return self._forward_train(batch)
#
# def _forward_exploration(self, batch):
# with torch.no_grad():
# return self._forward_train(batch)

Copilot uses AI. Check for mistakes.

Comment on lines +90 to +91
#if Columns.ACTION_LOGP not in batch:
# batch[Columns.ACTION_LOGP] = action_dist.logp(actions)
Copy link
Preview
Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] If the removal of the action log probability assignment is intentional, please remove the commented-out code to avoid confusion.

Suggested change
#if Columns.ACTION_LOGP not in batch:
# batch[Columns.ACTION_LOGP] = action_dist.logp(actions)
# Removed commented-out code to avoid confusion.

Copilot uses AI. Check for mistakes.

Comment on lines 48 to 49
#reward_agent_right += -0.1
#reward_agent_left += -0.1
Copy link
Preview
Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider either removing or clarifying the commented-out negative reward adjustments if they are no longer needed, to reduce potential confusion.

Suggested change
#reward_agent_right += -0.1
#reward_agent_left += -0.1
# Removed commented-out negative reward adjustments to reduce confusion.

Copilot uses AI. Check for mistakes.

sven1977 added 2 commits June 10, 2025 14:20
…icy (outputting multi-discrete actions), and separate reward streams and value heads.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
…icy (outputting multi-discrete actions), and separate reward streams and value heads.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0