[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

sven1977 · 2025-06-10T12:14:47Z

…icy (outputting multi-discrete actions), and separate reward streams and value heads.

Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards.

Main example script.
Custom new Env, which acts as a 2 player multi-agent env, but only exposes one agent ("global"), with "global" observations and handling "global" combined actions (a MultiDiscrete space, where each slot represents the individual action for each of the 2 agents), and individual rewards published through the infos dicts. Note that the "global" rewards in the env are dummy rewards and should not be used for training.
Custom LearnerConnector creating new columns "rewards_agent0" and "rewards_agent1" from the infos.
Custom RLModule handling the combined action space (MultiDiscrete) and 2 value heads (one for each agent).
Custom GAE connector for performing n GAE operations for a single module.
Custom Learner with custom loss computing the total loss as a sum of the individual agents' loss terms (derived from GAE on each individual reward stream and value function head).

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>

Copilot

Pull Request Overview

This PR overhauls the examples folder for multi-agent learning with global observations, integrating a new environment, a customized RLModule with multiple value function heads, and corresponding learner changes. Key changes include:

A new RLModule handling global observations and multiple VF heads.
A multi-agent CartPole environment with global observations and per-agent rewards.
Updates to learner connectors and loss computations to support the new global and multi-head design.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
rllib/examples/rl_modules/classes/global_observations_many_vf_heads_rlm.py	Introduces the new RLModule with an encoder, a multi-discrete (global) policy head, and multiple value function heads.
rllib/examples/multi_agent/global_observations_and_central_model.py	Sets up the multi-agent experiment with a global policy and a rewards connector constructing per-agent rewards.
rllib/examples/learners/classes/ppo_global_observations_many_vf_learner.py	Implements a custom PPO learner to compute loss and GAE for multiple value heads.
rllib/examples/envs/classes/multi_agent/two_agent_cartpole_with_global_observations.py	Defines the custom two-agent CartPole environment with global observations and individual reward streams.
rllib/evaluation/postprocessing.py	Updates postprocessing to use the centralized Columns constants.
rllib/core/rl_module/torch/torch_rl_module.py	Adds support for MultiDiscrete action spaces via TorchMultiCategorical.
rllib/core/columns.py	Refines and clarifies column constant definitions.
rllib/connectors/module_to_env/get_actions.py	Comments out the action log probability assignment, potentially for debugging or design reasons.
rllib/connectors/learner/general_advantage_estimation.py	Refactors GAE computations to use updated Columns constants.
rllib/algorithms/ppo/torch/ppo_torch_learner.py	Adjusts loss computation to reference updated column names.
rllib/algorithms/marwil/torch/marwil_torch_learner.py	Applies similar column name updates in loss logging.

Copilot · 2025-06-10T12:15:37Z

rllib/examples/rl_modules/classes/global_observations_many_vf_heads_rlm.py

+        # If you need more granularity between the different forward behaviors during
+        # the different phases of the module's lifecycle, implement three different
+        # forward methods. Thereby, it is recommended to put the inference and
+        # exploration versions inside a `with torch.no_grad()` context for better
+        # performance.
+        # def _forward_train(self, batch):
+        #    ...
+        #
+        # def _forward_inference(self, batch):
+        #    with torch.no_grad():
+        #        return self._forward_train(batch)
+        #
+        # def _forward_exploration(self, batch):
+        #    with torch.no_grad():
+        #        return self._forward_train(batch)
+


The commented-out alternative _forward methods appearing after the return statement are unreachable. Consider moving these notes to the module's docstring or removing them to improve code clarity.

Suggested change

# If you need more granularity between the different forward behaviors during

# the different phases of the module's lifecycle, implement three different

# forward methods. Thereby, it is recommended to put the inference and

# exploration versions inside a `with torch.no_grad()` context for better

# performance.

# def _forward_train(self, batch):

# ...

#

# def _forward_inference(self, batch):

# with torch.no_grad():

# return self._forward_train(batch)

#

# def _forward_exploration(self, batch):

# with torch.no_grad():

# return self._forward_train(batch)

Copilot · 2025-06-10T12:15:38Z

rllib/connectors/module_to_env/get_actions.py

+            #if Columns.ACTION_LOGP not in batch:
+            #    batch[Columns.ACTION_LOGP] = action_dist.logp(actions)


[nitpick] If the removal of the action log probability assignment is intentional, please remove the commented-out code to avoid confusion.

Suggested change

#if Columns.ACTION_LOGP not in batch:

# batch[Columns.ACTION_LOGP] = action_dist.logp(actions)

# Removed commented-out code to avoid confusion.

Copilot · 2025-06-10T12:15:38Z

rllib/examples/envs/classes/multi_agent/two_agent_cartpole_with_global_observations.py

+            #reward_agent_right += -0.1
+            #reward_agent_left += -0.1


[nitpick] Consider either removing or clarifying the commented-out negative reward adjustments if they are no longer needed, to reduce potential confusion.

Suggested change

#reward_agent_right += -0.1

#reward_agent_left += -0.1

# Removed commented-out negative reward adjustments to reduce confusion.

…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>

github-actions · 2025-06-26T00:40:27Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

learning 2-agent cartpole with shared, global observation, single pol…

deaf597

…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>

Copilot AI review requested due to automatic review settings June 10, 2025 12:14

sven1977 requested a review from a team as a code owner June 10, 2025 12:14

Copilot AI reviewed Jun 10, 2025

View reviewed changes

sven1977 added 2 commits June 10, 2025 14:20

learning 2-agent cartpole with shared, global observation, single pol…

ad5face

…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>

learning 2-agent cartpole with shared, global observation, single pol…

ccc7a78

…icy (outputting multi-discrete actions), and separate reward streams and value heads. Signed-off-by: sven1977 <svenmika1977@gmail.com>

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		#if Columns.ACTION_LOGP not in batch:
		# batch[Columns.ACTION_LOGP] = action_dist.logp(actions)

	#if Columns.ACTION_LOGP not in batch:
	# batch[Columns.ACTION_LOGP] = action_dist.logp(actions)
	# Removed commented-out code to avoid confusion.

	#reward_agent_right += -0.1
	#reward_agent_left += -0.1
	# Removed commented-out negative reward adjustments to reduce confusion.

[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

Are you sure you want to change the base?

[RLlib] Examples folder do-over (vol 53): Learning 2-agent cartpole with global observation, 1 policy outputting all agents' actions, and individual rewards. #53697

Uh oh!

Conversation

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!