Description
What happened + What you expected to happen
When running ray in our cluster, we observed a bug where transient network failure on RPC WaitForActorRefDeleted
caused actor registration failed. Here is the concrete call site of the RPC that encountered transient error. call_site.log
Here is the output error message.
E class_name: test_actor_api.<locals>.Foo
E actor_id: 9ba41b4ddd98b0554ef05f0204000000
E namespace: 345c7114-a740-49cd-9c7f-af89bb3dea24
E The actor is dead because all references to the actor were removed including lineage ref count.
E The actor never ran - it was cancelled before it started running.
By analyzing the concrete call site, we noticed that the callback function of WaitForActorRefDeleted
calls DestroyActor
without checking return status. Maybe the root cause of this bug is that transient error caused GCS destroyed the actor without waiting for it, resulting in that the actor is cancelled before it started running.
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Lines 1060 to 1066 in 7337f2a
The expected behavior is that transient network failure can be handled and the job is executed properly.
Versions / Dependencies
Ray 3.0.0.dev, Kuberay 1.3.0
Reproduction script
Start a RayCluster
using Kuberay
. Then run the following script with ray job submit SDK.
import ray
ray.init()
@ray.remote
class Foo:
def __init__(self, val):
self.x = val
def get(self):
return self.x
x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x
Transient network failure can be reproduced with gRPC interceptor.
Issue Severity
Medium: It is a significant difficulty but I can work around it.