[core] Recover intermediate objects if needed while generator running #53999

dayshah · 2025-06-22T22:27:46Z

Problem

Consider this sequence of events:

Node goes down and an object is lost. The reconstruction of this object depends on a streaming generator output. The streaming generator output was used and destroyed because the (non-lineage) ref count dropped to 0. The streaming generator is still running to produce more objects.
We now try to recover the streaming generator output by telling the task manager to resubmit the task.
The task manager will see that the task is not finished / failed and will assume that the object is currently being recovered.
The streaming generator will finish and never be rerun. The object will never be recovered and the retry that depended on that object will never move past PENDING_ARGS_AVAILABLE.

For concrete examples, see the Python tests added here. They would hang forever without this fix.

Solution

The solution is to cancel the running generator and resubmit it. We have to cancel and can't just wait for regular completion to resubmit because deadlock is possible due to generator backpressure, e.g. if calling ray.get(next(generator_ref)) is dependent on the completion of another task that needs the previous output of the generator. Note: cancelling in backpressure waiter is a follow-up.

Now if we find out that a streaming generator output is lost and we can't pin a secondary location, this is what will happen based on the current generator task status:

If the task has been pushed to the worker and not completed → Cancel and Resubmit
If the task is done (finished/failed) → Resubmit the task
If the task is submitted, but hasn’t been pushed to the worker → Do nothing
If the task is already in the submitter’s generators_to_resubmit_ map → Do nothing

Follow ups

We own the backpressure waiter, it's not user-defined code. If the generator is blocked on the executor due to backpressure, we can signal it when we get a cancel request to make it exit out there. This ensures deadlock is impossible.

CancelTask currently doesn't handle transient network failures. This is a change with a larger scope than this.

Currently, if multiple objects from the same generator are queued up to be recovered when the recovery periodical runner runs, we could resubmit for the first object and then cancel and resubmit for the second if argument resolution and sequence numbering lines up. Since this doesn't actually affect correctness and requires a bit of refactoring, it'll be in a follow-up PR.

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-06-23T05:50:08Z

src/ray/core_worker/transport/actor_task_submitter.cc

+        if (!status.ok() ||
+            (!reply.attempt_succeeded() && reply.requested_task_running())) {
+          RAY_LOG(INFO) << "Failed to cancel generator " << task_id << " with status "
+                        << status.ToString();


actor task cancellation is just not in a great spot, so proper cancellation failure handling is generally rough rn

dayshah · 2025-06-23T05:51:47Z

src/ray/core_worker/transport/normal_task_submitter.cc

+            if (cancel_retry_timer->expiry().time_since_epoch() <=
+                std::chrono::high_resolution_clock::now().time_since_epoch()) {
+              cancel_retry_timer->expires_after(boost::asio::chrono::milliseconds(
+                  RayConfig::instance().cancellation_retry_ms()));


maybe this shouldn't be a config variable and should be exponential? but that's a separate change to existing behavior

dayshah · 2025-06-23T05:53:28Z

src/ray/core_worker/task_manager.cc

-                /* include_task_info */ true,
-                task_entry.spec.AttemptNumber() + 1);
-}
-


i just inlined these two. They were only called in one spot each and are relatively small. Inlining makes it easier to reason about the whole thing imo.

dayshah · 2025-06-23T07:12:55Z

Note: still need unit tests but want a green light on logic.

edoakes · 2025-06-23T15:31:40Z

Defer to @israbbani for first pass

israbbani

The overall approach looks reasonable AFAICS. I think we need another pair of eyes from @jjyao.

Does Actor Task Cancelation, is it best effort since we don't retry on failures? If so, how do guarantee there will not be deadlock?

israbbani · 2025-06-23T23:40:48Z

src/ray/core_worker/task_manager.cc

+    // Resubmit was queued up.
+    if (still_executing) {
+      return true;
+    }


Does this mean we've already resubmitted the task?

No still_executing means we are now cancelling and resubmitting because the task is still on the executing worker. If still_executing is false it means the task completed the time we set generator_to_queue_for_resubmit so we should resubmit now instead of cancelling and resubmitting.

I'll make the variable names more clear and add a comment.

israbbani · 2025-06-23T23:42:37Z

src/ray/core_worker/task_manager.cc

+  // We should actually detect if the actor for this task is dead, but let's just assume
+  // it's not for now.


Is this meant to be in here? Should it at least be a TODO?

i didn't write this, it was here before. I think this isn't a problem today though, if the actor is dead dead and we try resubmitting we'll just fail the task and fail the corresponding object.

confirmed we just fail the task and mark the task object as failed if we try submitting and the actor is dead dead.

The only issue is that the other objects needed will still be reconstructed. They'll still be released though because task manager fail is still called.

So removing the comment.

src/ray/core_worker/task_manager.h

src/ray/core_worker/transport/actor_task_submitter.cc

src/ray/core_worker/task_manager.cc

israbbani · 2025-06-24T15:21:32Z

src/ray/core_worker/transport/actor_task_submitter.cc

+      [task_id = spec.TaskId()](const Status &status, const rpc::CancelTaskReply &reply) {
+        if (!status.ok() ||
+            (!reply.attempt_succeeded() && reply.requested_task_running())) {
+          RAY_LOG(INFO) << "Failed to cancel generator " << task_id << " with status "


It looks like we'll retry in HandlePushTaskReply if necessary. Does this mean cancelation is best-effort? If so, we can't guarantee that there won't be deadlock.

Ya no guarantee rn. We can guarantee no deadlock with the follow-up, so if the executor side received the cancel and if we ever get blocked due to backpressure we cancel there.

israbbani · 2025-06-24T15:23:50Z

src/ray/core_worker/transport/normal_task_submitter.cc

+          RAY_LOG(INFO) << "Failed to cancel generator " << task_id << " with status "
+                        << status.ToString();
+          return;


Same question as the Actor case. Is cancelation best effort?

More effort than single-threaded actor case. We'll call kill_main_task to interrupt the thread.

note this is still best-effort, users can intentionally or unintentionally catch the sigint and ignore it

dayshah · 2025-06-24T18:30:16Z

The overall approach looks reasonable AFAICS. I think we need another pair of eyes from @jjyao.

Does Actor Task Cancelation, is it best effort since we don't retry on failures? If so, how do guarantee there will not be deadlock?

single-threaded actor task cancelation is v low effort, if it's already running it won't do anything. We can't guarantee no deadlock until I make the follow up fix to propagate cancellation to the backpressure waiter.

edoakes · 2025-06-24T22:33:00Z

ping me for review once @israbbani's comments are addressed

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-06-25T21:08:45Z

ping me for review once @israbbani's comments are addressed

addressed / responded to comments

edoakes · 2025-06-26T20:56:57Z

Meta note before I forget: all three documented follow ups here are critical. @dayshah please make sure we file and follow up on all of them (rest of the team can help take some work).

edoakes

If I'm understanding correctly, the current implementation can cancel the task and resubmits the generator concurrently. This means there can be two invocations of the same streaming generator running at the same time (the one we're cancelling and the one we're resubmitting).

We document actor methods as having "at least once" semantics from the user's perspective, and I believe this can already happen under RPC failure conditions, so it isn't new but worth thinking about. For example, do we gracefully handle the case where the cancelled generator reports its results after the new generator has begun running and reported results earlier in the stream?

edoakes · 2025-06-26T21:00:47Z

python/ray/tests/test_streaming_generator_4.py

+        for i in range(3):
+            yield np.zeros(10 * 1024 * 1024, dtype=np.uint8)
+            ray.get(signal_actor.wait.remote())
+            time.sleep(10)


Don't we SIGINT the task in the case of cancellation? If that's the case, we can catch the SIGINT and explicitly record it (e.g., ping a signal actor). That would be much more obvious/self-documenting test behavior.

python/ray/tests/test_streaming_generator_4.py

edoakes · 2025-06-26T21:08:04Z

python/ray/tests/test_streaming_generator_4.py

+    # Recovery periodical runner runs every 100ms
+    time.sleep(0.1)


what exactly are we waiting for here and why?

src/ray/core_worker/transport/actor_task_submitter.cc

edoakes · 2025-06-26T21:20:57Z

src/ray/core_worker/transport/normal_task_submitter.cc

+  CancelGenerator(cancel_retry_timer_, client, spec.TaskId(), spec.CallerWorkerId());
+  return true;


why do we call CancelGenerator outside of the mutex?

Changed to call inside mutex, expected everything left to be thread safe but the timer actually isn't. So putting the timer behind an absl guard too, so can't make this mistake in the future.

src/ray/core_worker/transport/actor_task_submitter.cc

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-06-27T04:51:27Z

If I'm understanding correctly, the current implementation can cancel the task and resubmits the generator concurrently. This means there can be two invocations of the same streaming generator running at the same time (the one we're cancelling and the one we're resubmitting).

It waits for the PushTask RPC to come back before resubmitting. If the PushTask RPC fails due to network failure, we'll go to the raylet to ask what's happening with the worker, and still go down that path even with this change. So it should never be possible for two of the same to be running at the same time, that can lead to a host of other problems.

For example, do we gracefully handle the case where the cancelled generator reports its results after the new generator has begun running and reported results earlier in the stream?

This is strictly an RPC ordering question right, the ReportGeneratorItem comes in after the PushTaskReply is handled(resubmit happens)? I'll look into this, but I'm assuming this is handled somehow, streaming generators are used p heavily, if just ordering being off = bug I'd assume we'd have seen it...

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah added 3 commits June 22, 2025 15:27

Recover intermediate objects if needed while generator running

58331ad

Signed-off-by: dayshah <dhyey2019@gmail.com>

actor submitter logic

7b37ca8

Signed-off-by: dayshah <dhyey2019@gmail.com>

up

bc4ef84

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah added the go add ONLY when ready to merge, run all tests label Jun 23, 2025

dayshah commented Jun 23, 2025

View reviewed changes

dayshah assigned jjyao, edoakes and israbbani Jun 23, 2025

dayshah marked this pull request as ready for review June 23, 2025 05:53

dayshah requested review from pcmoritz, raulchen and a team as code owners June 23, 2025 05:53

israbbani reviewed Jun 24, 2025

View reviewed changes

dayshah added 4 commits June 25, 2025 13:18

up

7245f79

Signed-off-by: dayshah <dhyey2019@gmail.com>

address more comments

3337eb4

Signed-off-by: dayshah <dhyey2019@gmail.com>

line length

09fb838

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix comment

0f33ae3

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from israbbani June 25, 2025 21:08

edoakes reviewed Jun 26, 2025

View reviewed changes

dayshah added 2 commits June 26, 2025 15:27

address comments

1a070c5

Signed-off-by: dayshah <dhyey2019@gmail.com>

better Python tests

eab6ddf

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from edoakes June 27, 2025 04:51

give up

4e014b3

Signed-off-by: dayshah <dhyey2019@gmail.com>

		// We should actually detect if the actor for this task is dead, but let's just assume
		// it's not for now.

		# Recovery periodical runner runs every 100ms
		time.sleep(0.1)

		CancelGenerator(cancel_retry_timer_, client, spec.TaskId(), spec.CallerWorkerId());
		return true;

[core] Recover intermediate objects if needed while generator running #53999

Are you sure you want to change the base?

[core] Recover intermediate objects if needed while generator running #53999

Conversation

Uh oh!

Problem

Solution

Follow ups

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!