Description
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
Currently, Ray's error handling is as follow;
- If any remote call fails, the returned object ref will contain an exception
- If returned obj is not caught by ray.get, and it goes out of scope, it prints the error message to the caller
- If returned obj is caught, it raises an exception.
Note that in the past, we handled it by "always logging exceptions to log files". But we removed this feature so that we can have better error handling model (we don't want to print errors before ray.get is called).
The problem is this model doesn't go well with detached actor. For example;
- Create a detached actor
- Detached actor raises an exception in its method
- And the driver exits before the method raises an exception
In this case, detached actors will raise an exception, but there's no way to know this because the exceptions are not logged. From user perspective, it looks like everything went well, but in the real world, the actor method has failed. I think this can be problematic in some detached actor based workloads especially when detached actors are used for "services" like appliations.
Versions / Dependencies
master
Reproduction script
import ray
ray.init("auto")
@ray.remote
class A:
def r(self):
pass
def s(self):
import time
time.sleep(10)
raise ValueError("abc")
a = A.options(lifetime="detached").remote()
ray.get(a.r.remote())
a.s.remote()
import time
time.sleep(2)
And then after 10 seconds, detached actor's s fails with ValueError, but there's no way to know this because it is not logged.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!