[Core] Transient network failure on RPC MarkJobFinished causes node crash

What happened + What you expected to happen

When running ray in our cluster, we observed a bug where the head node crashed and triggered a cluster restart unexpectedly. The cause is that during the return of the MarkJobFinished RPC response to the raylet process, a transient network failure occurred, causing the client (raylet) to receive an “UNAVAILABLE” instead of a confirmation.

Such transient errors are expected to be handled gracefully. However, this one caused an unexpected head node crash.

We collected the concrete RPC call site and pasted here. call_site.log

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()
    
@ray.remote
class Foo:
    def __init__(self, val):
        self.x = val

    def get(self):
        return self.x

x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions