Description
I was unable to verify the results reported for algorithm MAA2C_NS and TAG
task. Even after correcting for the add_value_last_step=False
as per issue #43.
Upon cross validation I found evidence pointing to the possibility of
swapped values between the maximum returns for shared parameters, Table 3, and
the maximum returns, Table 7, for non-shared parameters modalities.
Reproduce:
- Commit reference: 3d1463d
- Divide the rewards by a factor of 3.0 according to issue Cannot obtain the reported results on MPE:SimpleSpread task #29
- Set configurations: (i)
maa2c_ns.yaml
according to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Settime_limit=25
ingymma.yaml
. - Set
add_value_last_step=False
Config:
{
"action_selector": "soft_policies",
"add_value_last_step": false,
"agent": "rnn_ns",
"agent_output_type": "pi_logits",
"batch_size": 10,
"batch_size_run": 10,
"buffer_cpu_only": true,
"buffer_size": 10,
"checkpoint_path": "",
"critic_type": "cv_critic_ns",
"entropy_coef": 0.01,
"env": "gymma",
"env_args": { "key": "mpe:SimpleTag-v0",
"pretrained_wrapper": "PretrainedTag",
"seed": 343532797,
"state_last_action": false,
"time_limit": 25},
"evaluate": false,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "actor_critic_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 250000,
"lr": 0.0003,
"mac": "non_shared_mac",
"mask_before_softmax": true,
"name": "maa2c_ns",
"obs_agent_id": false,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"q_nstep": 5,
"repeat_id": 1,
"runner": "parallel",
"runner_log_interval": 10000,
"save_model": false,
"save_model_interval": 500000,
"save_replay": false,
"seed": 343532797,
"standardise_returns": false,
"standardise_rewards": true,
"t_max": 20050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 500000,
"test_nepisode": 100,
"use_cuda": false,
"use_rnn": true,
"use_tensorboard": true
}
Considerations
The first consideration is that I have ran experiments for both MAA2C and MAA2C_NS,
and got better results for the MAA2C.
The second consideration is the consistency of results for the Tag task, as reported: We
observe that in all environments except the matrix games, parameter sharing
improves the returns over no parameter sharing. While the average values
presented in Figure 3 do not seem statistically significant, by looking closer
in Tables 3 and 7 we observe that in several cases of algorithm-task pairs the
improvement due to parameter sharing seems significant. Such improvements can
be observed for most algorithms in MPE tasks, especially in Speaker-Listener
and Tag.
Table A groups the results for all the algorithms, minus COMA, for both
modalities for the MPE environment and shows the variation of the results. A
positive change means that the parameter sharing variation has excess of
maximum returns over the non-shared parameters.
Table A: Maximum returns over five seeds for eight algorithms with
parameter sharing (PS), without parameter sharing (NS), and the change in
excess of returns for MPE tasks.
Algorithm | Task | PS | NS | Change (%) |
---|---|---|---|---|
IQL | Speaker-Listener | -18.36 | -18.61 | 1.36% |
Spread | -132.63 | -141.87 | 6.97% | |
Adversary | 9.38 | 9.09 | 3.09% | |
Tag | 22.18 | 19.18 | 13.53% | |
IA2C | Speaker-Listener | -12.6 | -17.08 | 35.56% |
Spread | -134.43 | -131.74 | -2.00% | |
Adversary | 12.12 | 10.8 | 10.89% | |
Tag | 17.44 | 16.04 | 8.03% | |
IPPO | Speaker-Listener | -13.1 | -15.56 | 18.78% |
Spread | -133.86 | -132.46 | -1.05% | |
Adversary | 12.17 | 11.17 | 8.22% | |
Tag | 19.44 | 18.46 | 5.04% | |
MADDPG | Speaker-Listener | -13.56 | -12.73 | -6.12% |
Spread | -141.7 | -136.73 | -3.51% | |
Adversary | 8.97 | 8.81 | 1.78% | |
Tag | 12.5 | 2.82 | 77.44% | |
MAA2C | Speaker-Listener | -10.71 | -13.66 | 27.54% |
Spread | -129.9 | -130.88 | 0.75% | |
Adversary | 12.06 | 10.88 | 9.78% | |
Tag | 19.95 | 26.5 | -32.83% | |
MAPPO | Speaker-Listener | -10.68 | -14.35 | 34.36% |
Spread | -133.54 | -128.64 | -3.67% | |
Adversary | 11.3 | 12.04 | -6.55% | |
Tag | 18.52 | 17.96 | 3.02% | |
VDN | Speaker-Listener | -15.95 | -15.47 | -3.01% |
Spread | -131.03 | -142.13 | 8.47% | |
Adversary | 9.28 | 9.34 | -0.65% | |
Tag | 24.5 | 18.44 | 24.73% | |
QMIX | Speaker-Listener | -11.56 | -11.59 | 0.26% |
Spread | -126.62 | -130.97 | 3.44% | |
Adversary | 9.67 | 11.32 | -17.06% | |
Tag | 31.18 | 26.88 | 13.79% |
- Average Change (%): 7.51%
- Total Change (%): 240.40%
More strictly, the differences are even larger when we take into account only the Tag task.
Table B: Maximum returns over five seeds for the Tag task with parameter sharing (PS),
without parameter sharing (NS), the excess of returns of PS over NS, and the change in
excess of returns for the eight algorithms.
Algorithm | PS | NS | Excess of Returns | Change (%) |
---|---|---|---|---|
IQL | 22.18 | 19.18 | 3 | 13.53% |
IA2C | 17.44 | 16.04 | 1.4 | 8.03% |
IPPO | 19.44 | 18.46 | 0.98 | 5.04% |
MADDPG | 12.5 | 2.82 | 9.68 | 77.44% |
MAA2C | 19.95 | 26.5 | -6.55 | -32.83% |
MAPPO | 18.52 | 17.96 | 0.56 | 3.02% |
VDN | 24.5 | 18.44 | 6.06 | 24.73% |
QMIX | 31.18 | 26.88 | 4.3 | 13.79% |
- Average Change: 2.42875 14.09%
- Total Change: 19.43 112.75%
Can you confirm that is indeed the case? Or point to the right direction.
Thanks,