Open
Description
What happened + What you expected to happen
This concerns the lines of code:
ray/rllib/models/torch/misc.py
Lines 13 to 19 in a4434fa
The issue with this line of code is:
- in case of a small tensor, divide by 0 may happen
- torch.sqrt() is rareley used and cases a mkl compute issue with the current configuration.
The mkl issue is arising for roughly 15% of the nodes in our cluster.
Replacingnormc_initializer(1.0)
withtorch.nn.init.xavier_uniform_
resolves the issue.
This should however be solved by Pytorch and not Ray.
I expect a torch.sqrt to reproduce the same results as torch.pow(tensor, 0.5)
def normc_initializer(std: float = 1.0) -> Any:
def initializer(tensor):
tensor.data.normal_(0, 1)
tensor.data *= std / (1e-10 + tensor.data.pow(2).sum(1, keepdim=True).pow(0.5))
return initializer
\tmp\worker-b00efb5[...]018.err
:actor_name:RolloutWorker
INFO:root:model: model init called..
*** SIGFPE received at time=1667782622 on cpu 10 ***
PC: @ 0x14f13b815d47 (unknown) mkl_vml_serv_GetMinN
@ 0x14f1783d93c0 (unknown) (unknown)
@ 0x14f1409d5752 (unknown) vmsSqrt
@ 0x3f2c2d8f3f4faf39 (unknown) (unknown)
[2022-11-07 01:57:02,754 E 4022018 4022018] logging.cc:325: *** SIGFPE received at time=1667782622 on cpu 10 ***
[2022-11-07 01:57:02,754 E 4022018 4022018] logging.cc:325: PC: @ 0x14f13b815d47 (unknown) mkl_vml_serv_GetMinN
[2022-11-07 01:57:02,755 E 4022018 4022018] logging.cc:325: @ 0x14f1783d93c0 (unknown) (unknown)
[2022-11-07 01:57:02,755 E 4022018 4022018] logging.cc:325: @ 0x14f1409d5752 (unknown) vmsSqrt
[2022-11-07 01:57:02,757 E 4022018 4022018] logging.cc:325: @ 0x3f2c2d8f3f4faf39 (unknown) (unknown)
Fatal Python error: Floating point exception
Stack (most recent call first):
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/misc.py", line 15 in initializer
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/misc.py", line 151 in __init__
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/fcnet.py", line 57 in __init__
File "/home/mfeil/csw_rrl/object-centric-rl/rrl/models/rllib/causalworld_model.py", line 79 in __init__
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/catalog.py", line 609 in get_model_v2
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 158 in __init__
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 40 in __init__
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 152 in create_policy
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1722 in _build_policy_map
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 626 in __init__
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/_private/function_manager.py", line 701 in actor_method_executor
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/worker.py", line 449 in main_loop
File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/workers/default_worker.py", line 235 in <module>
Versions / Dependencies
Docker with nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
Python 3.9.7
Ray versions 1.9.0, 1.11.0, 1.12.0, 1.13.0
Pytorch 1.10.2, also tested on and 1.8.2 torch 1.12.1+cu116
Reproduction script
needs-repo-script
Issue Severity
Medium: It is a significant difficulty but I can work around it.