[Core] Transient network failure on RPC WaitForActorRefDeleted causes actor registration fail

What happened + What you expected to happen

When running ray in our cluster, we observed a bug where transient network failure on RPC WaitForActorRefDeleted caused actor registration failed. Here is the concrete call site of the RPC that encountered transient error. call_site.log

Here is the output error message.

E                               class_name: test_actor_api.<locals>.Foo
E                               actor_id: 9ba41b4ddd98b0554ef05f0204000000
E                               namespace: 345c7114-a740-49cd-9c7f-af89bb3dea24
E                       The actor is dead because all references to the actor were removed including lineage ref count.
E                       The actor never ran - it was cancelled before it started running.

By analyzing the concrete call site, we noticed that the callback function of WaitForActorRefDeleted calls DestroyActor without checking return status. Maybe the root cause of this bug is that transient error caused GCS destroyed the actor without waiting for it, resulting in that the actor is cancelled before it started running.

ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc

Lines 1060 to 1066 in 7337f2a

    
           if (node_it != owners_.end() && node_it->second.count(owner_id)) { 
        
             // Only destroy the actor if its owner is still alive. The actor may 
        
             // have already been destroyed if the owner died. 
        
             DestroyActor(actor_id, 
        
                          GenActorRefDeletedCause(GetActor(actor_id)), 
        
                          /*force_kill=*/true); 
        
           }

The expected behavior is that transient network failure can be handled and the job is executed properly.

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()
    
@ray.remote
class Foo:
    def __init__(self, val):
        self.x = val

    def get(self):
        return self.x

x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == x

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if (node_it != owners_.end() && node_it->second.count(owner_id)) {
	// Only destroy the actor if its owner is still alive. The actor may
	// have already been destroyed if the owner died.
	DestroyActor(actor_id,
	GenActorRefDeletedCause(GetActor(actor_id)),
	/force_kill=/true);
	}

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions