[core] Fix race condition in raylet graceful shutdown #53762

codope · 2025-06-12T06:52:43Z

Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to the flaky behavior:

Agent is Killed: The test sends a SIGKILL to an agent process (e.g., dashboard_agent).
Graceful Shutdown Initiated: Because raylet and the agent share fate, irrespective of the waitpid status, the AgentManager triggers a graceful shutdown of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION.
Race Condition: While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as DEAD, and notifies the raylet of this status change.
Fatal Crash: The raylet, upon receiving the DEAD status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS.

Proposed Fix

In node_manager.cc, when creating the AgentManager instances, the shutdown_raylet_gracefully_ callback should be wrapped in a lambda that first sets is_shutdown_request_received_ = true. This will prevent the NodeRemoved callback from incorrectly triggering a fatal crash during an intentional graceful shutdown.

Related issue number

#53739

Copilot

Pull Request Overview

This PR fixes a race condition during raylet graceful shutdown by marking intentional shutdowns and enforcing immediate exit on agent hard failures.

Wraps the graceful shutdown callback to set is_shutdown_request_received_ before invoking it, preventing unintended fatal crashes.
Updates AgentManager to distinguish zero and non-zero agent exits, logging a fatal error and exiting immediately on hard failures.
Clarifies log messages to separate successful agent exits from unexpected crashes.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/ray/raylet/node_manager.cc	Wraps `shutdown_raylet_gracefully_` in a lambda that sets `is_shutdown_request_received_`.
src/ray/raylet/agent_manager.cc	Adds branch for non-zero agent exit to `RAY_LOG(FATAL)` and refines the zero-exit log and reason.

Comments suppressed due to low confidence (3)

src/ray/raylet/agent_manager.cc:87

The message 'failed with exit code 0' can be confusing since exit code 0 usually means success; consider rephrasing to 'exited with code 0' for clarity.

RAY_LOG(ERROR) << "The raylet exited immediately because one Ray agent failed with exit code 0"

src/ray/raylet/agent_manager.cc:111

[nitpick] Include the non-zero exit_code in the FATAL log message to provide more diagnostic context, matching the detailed reason_message in the graceful path.

RAY_LOG(FATAL) << "The raylet exited immediately because agent '" << options_.agent_name

src/ray/raylet/agent_manager.cc:109

Add a unit or integration test that simulates a non-zero agent exit to verify that the FATAL logging and immediate shutdown behavior work as expected.

// Agent exited with a non-zero code, indicating a hard failure.

Copilot · 2025-06-12T06:53:56Z

src/ray/raylet/node_manager.cc

+        is_shutdown_request_received_ = true;
+        shutdown_raylet_gracefully_(death_info);


[nitpick] The lambda wrapping shutdown_raylet_gracefully_ is duplicated in two agent manager factories. Consider extracting it into a shared helper to reduce code duplication and improve readability.

Suggested change

is_shutdown_request_received_ = true;

shutdown_raylet_gracefully_(death_info);

HandleNodeDeath(death_info);

edoakes · 2025-06-12T17:09:48Z

As discussed offline:

The existing logic to fail health checks during graceful shutdown and then ignore the GCS notification looks like a hack. We should not be failing health checks until the Raylet is fully shut down. The way it's currently written is error prone and also likely to be misleading in terms of observability (GCS thinks the Raylet is unhealthy, it's just shutting down).
This change looks fine to me as a patch fix before we fix the above. But we should rename the flag, as it's no long longer set only when the RPC is received, but also on the agent exit path. In general we should unify all shutdown paths, similar to core worker :)
The change to un-gracefully exit if the agent dies unexpectedly also makes sense to me, but it's an independent behavior change.

So let's split the changes into three PRs:

Patch fix for the bug described in the description, reusing and renaming the flag.
Change the behavior to crash on unexpected agent death.
Properly fix the shutdown sequence to not fail health checks when draining.

israbbani

As far as I can tell, your solution will work. Good work on the find and root cause!

I have a few followups that should be created and prioritized. I think @edoakes and you've covered them already, but I'll add them here for completeness:

During graceful shutdown, the raylet should (ideally) transition to a DRAIN state and keep heartbeating until the last step of the shutdown process. The principle being that the data plane shuts down before the control plane.
The GCS cannot reliably send a KILL signal to a node that it thinks is already dead. It’s okay as best-effort STONITH, but if a raylet discovers that it’s heartbeating connection has been terminated by the GCS, it should commit suicide.

🚢 @jjyao should still review since he's most familiar with the GCS -> NodeManager shutdown path.

israbbani · 2025-06-12T21:25:53Z

src/ray/raylet/agent_manager.cc

+        RAY_LOG(ERROR)
+            << "The raylet exited immediately because one Ray agent failed with exit code 0"
+            << ", agent_name = " << options_.agent_name << ".\n"
+            << "The raylet fate shares with the agent. This can happen because\n"
+               "- The version of `grpcio` doesn't follow Ray's requirement. "
+               "Agent can segfault with the incorrect `grpcio` version. "
+               "Check the grpcio version `pip freeze | grep grpcio`.\n"
+               "- The agent failed to start because of unexpected error or port "
+               "conflict. Read the log `cat "
+               "/tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. "
+               "You can find the log file structure here "
+               "https://docs.ray.io/en/master/ray-observability/user-guides/"
+               "configure-logging.html#logging-directory-structure.\n"
+               "- The agent is killed by the OS (e.g., out of memory).";


This log line has a few problems

@edoakes are grpcio version issues and port conflicts still common root causes for why the dashboard can't start up? There was a customer issue where the user was confused about why they were seeing grpcio in the logs.

At least a few of the causes her (segfault, OOMed) look like they should not exit with code 0.

I'd suggest removing misleading suggestions and adding the ungraceful ones to the FATAL error.

israbbani · 2025-06-12T21:33:47Z

src/ray/raylet/node_manager.cc

+      [this](const rpc::NodeDeathInfo &death_info) {
+        is_shutdown_request_received_ = true;
+        shutdown_raylet_gracefully_(death_info);


Can you use this->is_shutdown_request_received_ and this->shutdown_raylet_gracefully_ to make the capture more readable? I think binding lambdas to this is a code smell and this at least makes it explicit (see #53654 (comment))

israbbani · 2025-06-13T17:40:14Z

src/ray/raylet/agent_manager.cc

-      // If the process is not terminated within 10 seconds, forcefully kill raylet
-      // itself.
-      delay_executor_([]() { QuickExit(); }, /*ms*/ 10000);
+      if (exit_code == 0) {


I glanced at the code inside process.cc and it's not obvious to me that waitpid is actually called correctly. If you haven't already investigated it, it might be worth glancing over.

I did look into that. waitpid is not called in most cases because we launch the agent with a pipe to stdin. Unless the pipe creation failed, Process::Wait() will return status based on pipe-based waiting (fd != -1).

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

jjyao

LGTM.

It makes sense to me that we don't need to gracefully shut down raylet when agent died ungracefully.

## Why are these changes needed? See #53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number #53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

) ## Why are these changes needed? See ray-project#53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number ray-project#53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Scott Lee <scott.lee@rebellions.ai>

) ## Why are these changes needed? See ray-project#53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number ray-project#53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

Copilot AI review requested due to automatic review settings June 12, 2025 06:52

codope added the go add ONLY when ready to merge, run all tests label Jun 12, 2025

Copilot AI reviewed Jun 12, 2025

View reviewed changes

edoakes requested a review from a team June 12, 2025 16:56

israbbani approved these changes Jun 13, 2025

View reviewed changes

codope added 3 commits June 16, 2025 08:03

[core] Fix race condition in raylet graceful shutdown

48ff6ca

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

fix logs

5881644

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

rename flag and remove the fatal exit from agent manager

48b6bc4

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope force-pushed the core-53739-race-raylet-shutdown branch from c040b1c to 48b6bc4 Compare June 16, 2025 08:03

codope mentioned this pull request Jun 16, 2025

[core] Ungracefully exit if the agent dies unexpectedly #53847

Open

edoakes approved these changes Jun 16, 2025

View reviewed changes

edoakes merged commit 9e25884 into ray-project:master Jun 16, 2025
5 checks passed

jjyao reviewed Jun 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Fix race condition in raylet graceful shutdown #53762

[core] Fix race condition in raylet graceful shutdown #53762

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		is_shutdown_request_received_ = true;
		shutdown_raylet_gracefully_(death_info);

	is_shutdown_request_received_ = true;
	shutdown_raylet_gracefully_(death_info);
	HandleNodeDeath(death_info);

[core] Fix race condition in raylet graceful shutdown #53762

[core] Fix race condition in raylet graceful shutdown #53762

Uh oh!

Conversation

Uh oh!

Why are these changes needed?

Proposed Fix

Related issue number

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!