8000 [RLLIB][Torch] numerically unstable + mkl issue in torch.sqrt normc_initializer · Issue #30191 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[RLLIB][Torch] numerically unstable + mkl issue in torch.sqrt normc_initializer #30191
Open
@michaelfeil

Description

@michaelfeil

What happened + What you expected to happen

This concerns the lines of code:

@DeveloperAPI
def normc_initializer(std: float = 1.0) -> Any:
def initializer(tensor):
tensor.data.normal_(0, 1)
tensor.data *= std / torch.sqrt(tensor 7ED3 .data.pow(2).sum(1, keepdim=True))
return initializer

The issue with this line of code is:

  1. in case of a small tensor, divide by 0 may happen
  2. torch.sqrt() is rareley used and cases a mkl compute issue with the current configuration.
    The mkl issue is arising for roughly 15% of the nodes in our cluster.
    Replacing normc_initializer(1.0) with torch.nn.init.xavier_uniform_ resolves the issue.
    This should however be solved by Pytorch and not Ray.

I expect a torch.sqrt to reproduce the same results as torch.pow(tensor, 0.5)

def normc_initializer(std: float = 1.0) -> Any:
    def initializer(tensor):
        tensor.data.normal_(0, 1)
        tensor.data *= std / (1e-10 + tensor.data.pow(2).sum(1, keepdim=True).pow(0.5))

    return initializer

\tmp\worker-b00efb5[...]018.err

:actor_name:RolloutWorker
INFO:root:model: model init called..
*** SIGFPE received at time=1667782622 on cpu 10 ***
PC: @     0x14f13b815d47  (unknown)  mkl_vml_serv_GetMinN
    @     0x14f1783d93c0  (unknown)  (unknown)
    @     0x14f1409d5752  (unknown)  vmsSqrt
    @ 0x3f2c2d8f3f4faf39  (unknown)  (unknown)
[2022-11-07 01:57:02,754 E 4022018 4022018] logging.cc:325: *** SIGFPE received at time=1667782622 on cpu 10 ***
[2022-11-07 01:57:02,754 E 4022018 4022018] logging.cc:325: PC: @     0x14f13b815d47  (unknown)  mkl_vml_serv_GetMinN
[2022-11-07 01:57:02,755 E 4022018 4022018] logging.cc:325:     @     0x14f1783d93c0  (unknown)  (unknown)
[2022-11-07 01:57:02,755 E 4022018 4022018] logging.cc:325:     @     0x14f1409d5752  (unknown)  vmsSqrt
[2022-11-07 01:57:02,757 E 4022018 4022018] logging.cc:325:     @ 0x3f2c2d8f3f4faf39  (unknown)  (unknown)
Fatal Python error: Floating point exception

Stack (most recent call first):
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/misc.py", line 15 in initializer
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/misc.py", line 151 in __init__
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/torch/fcnet.py", line 57 in __init__
  File "/home/mfeil/csw_rrl/object-centric-rl/rrl/models/rllib/causalworld_model.py", line 79 in __init__
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/models/catalog.py", line 609 in get_model_v2
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 158 in __init__
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 40 in __init__
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 152 in create_policy
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1722 in _build_policy_map
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 626 in __init__
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/_private/function_manager.py", line 701 in actor_method_executor
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/worker.py", line 449 in main_loop
  File "/rrl_app/miniconda/lib/python3.9/site-packages/ray/workers/default_worker.py", line 235 in <module>

Versions / Dependencies

Docker with nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
Python 3.9.7
Ray versions 1.9.0, 1.11.0, 1.12.0, 1.13.0
Pytorch 1.10.2, also tested on and 1.8.2 torch 1.12.1+cu116

Reproduction script

needs-repo-script

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.rllibRLlib related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0