8000 [Ray Core] Detached actor doesn't finish method after the client disconnects · Issue #53665 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Ray Core] Detached actor doesn't finish method after the client disconnects #53665
Open
@pinduzera

Description

@pinduzera

What happened + What you expected to happen

At least since ray 2.34.

Connect to a remote cluster, don't use a local started by ray.init() otherwise you can't see the error on dashboard.

The idea is to create job without using the job Api since the dashboard doesn't offer TLS by default, relying on detached actor.

Versions / Dependencies

From ray 2.34 up to 2.45 at least.

Reproduction script

I am spawning a detached Actor using ray.init and kicking of an method on it that I wish to execute after my local connection exits, but it fails if I bailout the connection:

address = "ray://localhost:10001"
@ray.remote
class myactor:
    def __init__(self):
        self.state = 0

    def increment(self):
        self.state += 1
        sleep(5)
        with open("/tmp/simple_test.txt", "w") as f:
            f.write(f"done {self.state}\n")
        return self.state

conn = ray.init(address = address)

acthdl = myactor.options(name= "testact", lifetime="detached").remote()

### we are not waiting for this future, we just want it to keep doing it's thing after we disconnect
fut = acthdl.increment.remote()

# sleep(6) ## if you uncomment, it works as intended

conn.disconnect()

If you run this and open the dashboard you will notice that the acthld.incremetal.remote() never finished (you can see that the file is never written), it complains about the driver exit and fails. I expected it would keep running since this is an "detached" actor.

Error Type: WORKER_DIED

Job finishes (07000000) as driver exits. Marking all non-terminal tasks as failed.

If you rerun, but with the sleep (it could be a ray.get/wait, but I don't want the result) it will run, but defeats the purpose since it will be a blocking function, which I don't want.

A possible workaround found is making it a two level approach:

import ray
from time import sleep
address = "ray://localhost:10001"

@ray.remote
class launcher():
    def __init__(self):
        hdls = [myactor.options(num_cpus = 0 ).remote() for i in range(10)]
        states = [hdl.increment.remote() for hdl in hdls]

        st = sum(ray.get(states))
        with open("/tmp/test.txt", "w") as f:
            f.write(f"done {st}\n")
        allkilled = [ray.kill(act) for act in hdls]
        
@ray.remote
class myactor():
    def __init__(self):
        self.state = 0

    def increment(self):
        self.state += 1
        sleep(5)
        return self.state
    

cli = ray.init(address = address)

acthdl = launcher.options(name= "jobLauncher", lifetime="detached").remote()

cli.disconnect()

This will kickoff the 10 myactor and run them, but eventually when the launcher actor would have finished instantiating itself it also returns the same WORKER_DIED error, but at least now it executed all the branched connections. And the actor will be held in place, since it is detached. But it doesn't seem to be able to finish it's method without the active connection that requested it.

The weird part is that I can't let a method run in a detached actor (maybe detached method?), which you would expect, but looks like it is bound to the connection. This implies that we always have to wait for the actor method to finish and can't proceed.

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CorequestionJust a question :)stability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0