Description
This is the comment as mentioned here https://github.com/ray-project/ray/pull/14122/files#diff-10f3fda5ddb0ff3dbb8f347dd7fc53101d2dd140585e72f2d55be831bd5455dbR134
What is the problem?
In most cases, a client object's lifetime matches its ID, but this isn't so with named actors. Performance can be improved by reverting this call to non-blocking.
Reproduction outline
Here's how named actors fail:
@ray.remote
class Actor:
def do_it(self):
pass
a = Actor.options(name="my_actor", lifetime="detached").remote()
# a has ActorRef 123, which is held in the server
del a
# We will (non-blocking) send a message to release 123 on the server side... sometime
b = ray.get_actor("my_actor")
# The server marks 123 as held in the set, which it already is!
# Now the non-blocking release comes in! It releases 123 on the server side, but we still have b as a reference on the client side
b.do_it.remote()
# Crashes here because now the server side doesn't have a reference to 123.
Potential fixes include attempting to reattach actor references if they've been cleaned up; better client logic around when and how to release objects; exclusively finishing all releases that may be queued before get_actor() happens (the reference will get removed and recreated, instead of happening over itself).
Of these, the last is probably the most flexible. On client release (__del__
), hold a lock that releases once the release message finishes (soft-blocking anything that needs all releases flushed, including other release messages) and have get_actor require that lock. All other messages go on in their usual way, and in the usual case, that release lock doesn't block execution