Are values from Tables 3-7 for task MPE Tag, algorithms MAA2C and MAA2C_NS swapped?

I was unable to verify the results reported for algorithm MAA2C_NS and TAG
task. Even after correcting for the add_value_last_step=False as per issue #43.
Upon cross validation I found evidence pointing to the possibility of
swapped values between the maximum returns for shared parameters, Table 3, and
the maximum returns, Table 7, for non-shared parameters modalities.

Reproduce:

Commit reference: 3d1463d
Divide the rewards by a factor of 3.0 according to issue Cannot obtain the reported results on MPE:SimpleSpread task #29
Set configurations: (i) maa2c_ns.yaml according to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Set time_limit=25
in gymma.yaml.
Set add_value_last_step=False

Config:

{   
    "action_selector": "soft_policies",
    "add_value_last_step": false,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 343532797,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 343532797,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

Considerations

The first consideration is that I have ran experiments for both MAA2C and MAA2C_NS,
and got better results for the MAA2C.

The second consideration is the consistency of results for the Tag task, as reported: We
observe that in all environments except the matrix games, parameter sharing
improves the returns over no parameter sharing. While the average values
presented in Figure 3 do not seem statistically significant, by looking closer
in Tables 3 and 7 we observe that in several cases of algorithm-task pairs the
improvement due to parameter sharing seems significant. Such improvements can
be observed for most algorithms in MPE tasks, especially in Speaker-Listener
and Tag.

Table A groups the results for all the algorithms, minus COMA, for both
modalities for the MPE environment and shows the variation of the results. A
positive change means that the parameter sharing variation has excess of
maximum returns over the non-shared parameters.

Table A: Maximum returns over five seeds for eight algorithms with
parameter sharing (PS), without parameter sharing (NS), and the change in
excess of returns for MPE tasks.

Algorithm	Task	PS	NS	Change (%)
IQL	Speaker-Listener	-18.36	-18.61	1.36%
	Spread	-132.63	-141.87	6.97%
	Adversary	9.38	9.09	3.09%
	Tag	22.18	19.18	13.53%
IA2C	Speaker-Listener	-12.6	-17.08	35.56%
	Spread	-134.43	-131.74	-2.00%
	Adversary	12.12	10.8	10.89%
	Tag	17.44	16.04	8.03%
IPPO	Speaker-Listener	-13.1	-15.56	18.78%
	Spread	-133.86	-132.46	-1.05%
	Adversary	12.17	11.17	8.22%
	Tag	19.44	18.46	5.04%
MADDPG	Speaker-Listener	-13.56	-12.73	-6.12%
	Spread	-141.7	-136.73	-3.51%
	Adversary	8.97	8.81	1.78%
	Tag	12.5	2.82	77.44%
MAA2C	Speaker-Listener	-10.71	-13.66	27.54%
	Spread	-129.9	-130.88	0.75%
	Adversary	12.06	10.88	9.78%
	Tag	19.95	26.5	-32.83%
MAPPO	Speaker-Listener	-10.68	-14.35	34.36%
	Spread	-133.54	-128.64	-3.67%
	Adversary	11.3	12.04	-6.55%
	Tag	18.52	17.96	3.02%
VDN	Speaker-Listener	-15.95	-15.47	-3.01%
	Spread	-131.03	-142.13	8.47%
	Adversary	9.28	9.34	-0.65%
	Tag	24.5	18.44	24.73%
QMIX	Speaker-Listener	-11.56	-11.59	0.26%
	Spread	-126.62	-130.97	3.44%
	Adversary	9.67	11.32	-17.06%
	Tag	31.18	26.88	13.79%

Average Change (%): 7.51%
Total Change (%): 240.40%

More strictly, the differences are even larger when we take into account only the Tag task.

Table B: Maximum returns over five seeds for the Tag task with parameter sharing (PS),
without parameter sharing (NS), the excess of returns of PS over NS, and the change in
excess of returns for the eight algorithms.

Algorithm	PS	NS	Excess of Returns	Change (%)
IQL	22.18	19.18	3	13.53%
IA2C	17.44	16.04	1.4	8.03%
IPPO	19.44	18.46	0.98	5.04%
MADDPG	12.5	2.82	9.68	77.44%
MAA2C	19.95	26.5	-6.55	-32.83%
MAPPO	18.52	17.96	0.56	3.02%
VDN	24.5	18.44	6.06	24.73%
QMIX	31.18	26.88	4.3	13.79%

Average Change: 2.42875 14.09%
Total Change: 19.43 112.75%

Can you confirm that is indeed the case? Or point to the right direction.

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproduce:

Config:

Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Reproduce:

Config:

Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions