8000 [core] Fix race condition in raylet graceful shutdown by codope · Pull Request #53762 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[core] Fix race condition in raylet graceful shutdown #53762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 16, 2025

Conversation

codope
Copy link
Contributor
@codope codope commented Jun 12, 2025

Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to the flaky behavior:

  1. Agent is Killed: The test sends a SIGKILL to an agent process (e.g., dashboard_agent).
  2. Graceful Shutdown Initiated: Because raylet and the agent share fate, irrespective of the waitpid status, the AgentManager triggers a graceful shutdown of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION.
  3. Race Condition: While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as DEAD, and notifies the raylet of this status change.
  4. Fatal Crash: The raylet, upon receiving the DEAD status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS.

Proposed Fix

In node_manager.cc, when creating the AgentManager instances, the shutdown_raylet_gracefully_ callback should be wrapped in a lambda that first sets is_shutdown_request_received_ = true. This will prevent the NodeRemoved callback from incorrectly triggering a fatal crash during an intentional graceful shutdown.

Related issue number

#53739

@Copilot Copilot AI review requested due to automatic review settings June 12, 2025 06:52
@codope codope added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition during raylet graceful shutdown by marking intentional shutdowns and enforcing immediate exit on agent hard failures.

  • Wraps the graceful shutdown callback to set is_shutdown_request_received_ before invoking it, preventing unintended fatal crashes.
  • Updates AgentManager to distinguish zero and non-zero agent exits, logging a fatal error and exiting immediately on hard failures.
  • Clarifies log messages to separate successful agent exits from unexpected crashes.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ray/raylet/node_manager.cc Wraps shutdown_raylet_gracefully_ in a lambda that sets is_shutdown_request_received_.
src/ray/raylet/agent_manager.cc Adds branch for non-zero agent exit to RAY_LOG(FATAL) and refines the zero-exit log and reason.
Comments suppressed due to low confidence (3)

src/ray/raylet/agent_manager.cc:87

  • The message 'failed with exit code 0' can be confusing since exit code 0 usually means success; consider rephrasing to 'exited with code 0' for clarity.
RAY_LOG(ERROR) << "The raylet exited immediately because one Ray agent failed with exit code 0"

src/ray/raylet/agent_manager.cc:111

  • [nitpick] Include the non-zero exit_code in the FATAL log message to provide more diagnostic context, matching the detailed reason_message in the graceful path.
RAY_LOG(FATAL) << "The raylet exited immediately because agent '" << options_.agent_name

src/ray/raylet/agent_manager.cc:109

  • Add a unit or integration test that simulates a non-zero agent exit to verify that the FATAL logging and immediate shutdown behavior work as expected.
// Agent exited with a non-zero code, indicating a hard failure.

Comment on lines 3253 to 3254
is_shutdown_request_received_ = true;
shutdown_raylet_gracefully_(death_info);
Copy link
Preview
Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The lambda wrapping shutdown_raylet_gracefully_ is duplicated in two agent manager factories. Consider extracting it into a shared helper to reduce code duplication and improve readability.

Suggested change
is_shutdown_request_received_ = true;
shutdown_raylet_gracefully_(death_info);
HandleNodeDeath(death_info);

Copilot uses AI. Check for mistakes.

@edoakes edoakes requested a review from a team June 12, 2025 16:56
@edoakes
Copy link
Collaborator
edoakes commented Jun 12, 2025

As discussed offline:

  • The existing logic to fail health checks during graceful shutdown and then ignore the GCS notification looks like a hack. We should not be failing health checks until the Raylet is fully shut down. The way it's currently written is error prone and also likely to be misleading in terms of observability (GCS thinks the Raylet is unhealthy, it's just shutting down).
  • This change looks fine to me as a patch fix before we fix the above. But we should rename the flag, as it's no long longer set only when the RPC is received, but also on the agent exit path. In general we should unify all shutdown paths, similar to core worker :)
  • The change to un-gracefully exit if the agent dies unexpectedly also makes sense to me, but it's an independent behavior change.

So let's split the changes into three PRs:

  1. Patch fix for the bug described in the description, reusing and renaming the flag.
  2. Change the behavior to crash on unexpected agent death.
  3. Properly fix the shutdown sequence to not fail health checks when draining.

Copy link
Contributor
@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, your solution will work. Good work on the find and root cause!

I have a few followups that should be created and prioritized. I think @edoakes and you've covered them already, but I'll add them here for completeness:

  1. During graceful shutdown, the raylet should (ideally) transition to a DRAIN state and keep heartbeating until the last step of the shutdown process. The principle being that the data plane shuts down before the control plane.

  2. The GCS cannot reliably send a KILL signal to a node that it thinks is already dead. It’s okay as best-effort STONITH, but if a raylet discovers that it’s heartbeating connection has been terminated by the GCS, it should commit suicide.

🚢 @jjyao should still review since he's most familiar with the GCS -> NodeManager shutdown path.

Comment on lines 87 to 100
RAY_LOG(ERROR)
<< "The raylet exited immediately because one Ray agent failed with exit code 0"
<< ", agent_name = " << options_.agent_name << ".\n"
<< "The raylet fate shares with the agent. This can happen because\n"
"- The version of `grpcio` doesn't follow Ray's requirement. "
"Agent can segfault with the incorrect `grpcio` version. "
"Check the grpcio version `pip freeze | grep grpcio`.\n"
"- The agent failed to start because of unexpected error or port "
"conflict. Read the log `cat "
"/tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. "
"You can find the log file structure here "
"https://docs.ray.io/en/master/ray-observability/user-guides/"
"configure-logging.html#logging-directory-structure.\n"
"- The agent is killed by the OS (e.g., out of memory).";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line has a few problems

  1. @edoakes are grpcio version issues and port conflicts still common root causes for why the dashboard can't start up? There was a customer issue where the user was confused about why they were seeing grpcio in the logs.
  2. At least a few of the causes her (segfault, OOMed) look like they should not exit with code 0.

I'd suggest removing misleading suggestions and adding the ungraceful ones to the FATAL error.

Comment on lines 3252 to 3254
[this](const rpc::NodeDeathInfo &death_info) {
is_shutdown_request_received_ = true;
shutdown_raylet_gracefully_(death_info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use this->is_shutdown_request_received_ and this->shutdown_raylet_gracefully_ to make the capture more readable? I think binding lambdas to this is a code smell and this at least makes it explicit (see #53654 (comment))

// If the process is not terminated within 10 seconds, forcefully kill raylet
// itself.
delay_executor_([]() { QuickExit(); }, /*ms*/ 10000);
if (exit_code == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanced at the code inside process.cc and it's not obvious to me that waitpid is actually called correctly. If you haven't already investigated it, it might be worth glancing over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look into that. waitpid is not called in most cases because we launch the agent with a pipe to stdin. Unless the pipe creation failed, Process::Wait() will return status based on pipe-based waiting (fd != -1).

codope added 3 commits June 16, 2025 08:03
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@edoakes edoakes merged commit 9e25884 into ray-project:master Jun 16, 2025
5 checks passed
Copy link
Collaborator
@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It makes sense to me that we don't need to gracefully shut down raylet when agent died ungracefully.

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
## Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
)

## Why are these changes needed?

See ray-project#53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

ray-project#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
)

## Why are these changes needed?

See ray-project#53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

ray-project#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0