8000 [Core] Transient network failure on RPC `MarkJobFinished` causes node crash · Issue #53645 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] Transient network failure on RPC MarkJobFinished causes node crash #53645
Closed
@qts0312

Description

@qts0312

What happened + What you expected to happen

When running ray in our cluster, we observed a bug where the head node crashed and triggered a cluster restart unexpectedly. The cause is that during the return of the MarkJobFinished RPC response to the raylet process, a transient network failure occurred, causing the client (raylet) to receive an “UNAVAILABLE” instead of a confirmation.

Such transient errors are expected to be handled gracefully. However, this one caused an unexpected head node crash.

We collected the concrete RPC call site and pasted here. call_site.log

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()
    
@ray.remote
class Foo:
    def __init__(self, val):
        self.x = val

    def get(self):
        return self.x

x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0