Description
What happened + What you expected to happen
When running ray in our cluster, we observed a bug where the head node crashed and triggered a cluster restart unexpectedly. The cause is that during the return of the MarkJobFinished
RPC response to the raylet process, a transient network failure occurred, causing the client (raylet) to receive an “UNAVAILABLE” instead of a confirmation.
Such transient errors are expected to be handled gracefully. However, this one caused an unexpected head node crash.
We collected the concrete RPC call site and pasted here. call_site.log
Versions / Dependencies
Ray 3.0.0.dev, Kuberay 1.3.0
Reproduction script
Start a RayCluster
using Kuberay
. Then run the following script with ray job submit SDK.
import ray
ray.init()
@ray.remote
class Foo:
def __init__(self, val):
self.x = val
def get(self):
return self.x
x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x
Transient network failure can be reproduced with gRPC interceptor.
Issue Severity
Medium: It is a significant difficulty but I can work around it.