Description
What happened + What you expected to happen
At least since ray 2.34.
Connect to a remote cluster, don't use a local started by ray.init()
otherwise you can't see the error on dashboard.
The idea is to create job without using the job Api since the dashboard doesn't offer TLS by default, relying on detached actor.
Versions / Dependencies
From ray 2.34 up to 2.45 at least.
Reproduction script
I am spawning a detached Actor using ray.init
and kicking of an method on it that I wish to execute after my local connection exits, but it fails if I bailout the connection:
address = "ray://localhost:10001"
@ray.remote
class myactor:
def __init__(self):
self.state = 0
def increment(self):
self.state += 1
sleep(5)
with open("/tmp/simple_test.txt", "w") as f:
f.write(f"done {self.state}\n")
return self.state
conn = ray.init(address = address)
acthdl = myactor.options(name= "testact", lifetime="detached").remote()
### we are not waiting for this future, we just want it to keep doing it's thing after we disconnect
fut = acthdl.increment.remote()
# sleep(6) ## if you uncomment, it works as intended
conn.disconnect()
If you run this and open the dashboard you will notice that the acthld.incremetal.remote()
never finished (you can see that the file is never written), it complains about the driver exit and fails. I expected it would keep running since this is an "detached" actor.
Error Type: WORKER_DIED
Job finishes (07000000) as driver exits. Marking all non-terminal tasks as failed.
If you rerun, but with the sleep (it could be a ray.get/wait, but I don't want the result) it will run, but defeats the purpose since it will be a blocking function, which I don't want.
A possible workaround found is making it a two level approach:
import ray
from time import sleep
address = "ray://localhost:10001"
@ray.remote
class launcher():
def __init__(self):
hdls = [myactor.options(num_cpus = 0 ).remote() for i in range(10)]
states = [hdl.increment.remote() for hdl in hdls]
st = sum(ray.get(states))
with open("/tmp/test.txt", "w") as f:
f.write(f"done {st}\n")
allkilled = [ray.kill(act) for act in hdls]
@ray.remote
class myactor():
def __init__(self):
self.state = 0
def increment(self):
self.state += 1
sleep(5)
return self.state
cli = ray.init(address = address)
acthdl = launcher.options(name= "jobLauncher", lifetime="detached").remote()
cli.disconnect()
This will kickoff the 10 myactor
and run them, but eventually when the launcher
actor would have finished instantiating itself it also returns the same WORKER_DIED
error, but at least now it executed all the branched connections. And the actor will be held in place, since it is detached. But it doesn't seem to be able to finish it's method without the active connection that requested it.
The weird part is that I can't let a method run in a detached actor (maybe detached method?), which you would expect, but looks like it is bound to the connection. This implies that we always have to wait for the actor method to finish and can't proceed.
Issue Severity
High: It blocks me from completing my task.