8000 [Core] Transient network failure on RPC `WaitForActorRefDeleted` causes actor registration fail · Issue #53797 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] Transient network failure on RPC WaitForActorRefDeleted causes actor registration fail #53797
Open
@qts0312

Description

@qts0312

What happened + What you expected to happen

When running ray in our cluster, we observed a bug where transient network failure on RPC WaitForActorRefDeleted caused actor registration failed. Here is the concrete call site of the RPC that encountered transient error. call_site.log

Here is the output error message.

E                               class_name: test_actor_api.<locals>.Foo
E                               actor_id: 9ba41b4ddd98b0554ef05f0204000000
E                               namespace: 345c7114-a740-49cd-9c7f-af89bb3dea24
E                       The actor is dead because all references to the actor were removed including lineage ref count.
E                       The actor never ran - it was cancelled before it started running.

By analyzing the concrete call site, we noticed that the callback function of WaitForActorRefDeleted calls DestroyActor without checking return status. Maybe the root cause of this bug is that transient error caused GCS destroyed the actor without waiting for it, resulting in that the actor is cancelled before it started running.

if (node_it != owners_.end() && node_it->second.count(owner_id)) {
// Only destroy the actor if its owner is still alive. The actor may
// have already been destroyed if the owner died.
DestroyActor(actor_id,
GenActorRefDeletedCause(GetActor(actor_id)),
/*force_kill=*/true);
}

The expected behavior is that transient network failure can be handled and the job is executed properly.

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()
    
@ray.remote
class Foo:
    def __init__(self, val):
        self.x = val

    def get(self):
        return self.x

x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0