Description
I'm trying to train MADDPG algorithm with smac environment with the code below
python src/main.py --config=maddpg --env-config=sc2 with env_args.map_name="corridor"
but I got this error:
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)
my system:
ubuntu: 20.04
python: 3.7.12
torch: 1.13.1
I try use others algorithms it run fine. I only got this problem when I use MADDPG algorithms and I got this problem with the main epymarl code and the addtional_algo code too.
The full error script is below
[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
src/main.py:81: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:50: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(v, collections.Mapping):
[INFO 10:22:34] root Saving to FileStorageObserver in results/sacred.
[DEBUG 10:22:34] pymarl Using capture mode "fd"
[INFO 10:22:34] pymarl Running command 'my_main'
[INFO 10:22:34] pymarl Started run with ID "4"
[DEBUG 10:22:34] pymarl Starting Heartbeat
[DEBUG 10:22:34] my_main Started
[INFO 10:22:34] my_main Experiment Parameters:
[INFO 10:22:34] my_main
{ 'add_value_last_step': True,
'agent': 'rnn',
'agent_output_type': 'pi_logits',
'batch_size': 32,
'batch_size_run': 1,
'buffer_cpu_only': True,
'buffer_size': 50000,
'checkpoint_path': '',
'critic_type': 'maddpg_critic',
'env': 'sc2',
'env_args': { 'continuing_episode': False,
'debug': False,
'difficulty': '7',
'game_version': None,
'heuristic_ai': False,
'heuristic_rest': False,
'map_name': 'corridor',
'move_amount': 2,
'obs_all_health': True,
'obs_instead_of_state': False,
'obs_last_action': False,
'obs_own_health': True,
'obs_pathing_grid': False,
'obs_terrain_height': False,
'obs_timestep_number': False,
'replay_dir': '',
'replay_prefix': '',
'reward_death_value': 10,
'reward_defeat': 0,
'reward_negative_scale': 0.5,
'reward_only_positive': True,
'reward_scale': True,
'reward_scale_rate': 20,
'reward_sparse': False,
'reward_win': 200,
'seed': 36613826,
'state_last_action': True,
'state_timestep_number': False,
'step_mul': 8},
'evaluate': False,
'gamma': 0.99,
'grad_norm_clip': 10,
'hidden_dim': 128,
'hypergroup': None,
'label': 'default_label',
'learner': 'maddpg_learner',
'learner_log_interval': 10000,
'load_step': 0,
'local_results_path': 'results',
'log_interval': 50000,
'lr': 0.0005,
'mac': 'maddpg_mac',
'name': 'maddpg',
'obs_agent_id': True,
'obs_individual_obs': False,
'obs_last_action': False,
'optim_alpha': 0.99,
'optim_eps': 1e-05,
'reg': 0.001,
'repeat_id': 1,
'runner': 'episode',
'runner_log_interval': 10000,
'save_model': True,
'save_model_interval': 50000,
'save_replay': True,
'seed': 36613826,
'standardise_returns': False,
'standardise_rewards': True,
't_max': 2050000,
'target_update_interval_or_tau': 200,
'test_greedy': True,
'test_interval': 50000,
'test_nepisode': 100,
'use_cuda': True,
'use_rnn': True,
'use_tensorboard': True}
2023-04-28 10:22:34.886446: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 10:22:34.972517: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2023-04-28 10:22:35.382670: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382712: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382715: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:39] pymarl Stopping Heartbeat
[ERROR 10:22:39] pymarl Failed after 0:00:05!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 117, in run_sequential
device="cpu" if args.buffer_cpu_only else args.device,
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 209, in init
super(ReplayBuffer, self).init(scheme, groups, buffer_size, max_seq_length, preprocess=preprocess, device=device)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 28, in init
self._setup_data(self.scheme, self.groups, batch_size, max_seq_length, self.preprocess)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 75, in _setup_data
self.data.transition_data[field_key] = th.zeros((batch_size, max_seq_length, *shape), dtype=dtype, device=self.device)
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)