-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[core] Fix race condition in raylet graceful shutdown #53762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix race condition in raylet graceful shutdown #53762
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition during raylet graceful shutdown by marking intentional shutdowns and enforcing immediate exit on agent hard failures.
- Wraps the graceful shutdown callback to set
is_shutdown_request_received_
before invoking it, preventing unintended fatal crashes. - Updates
AgentManager
to distinguish zero and non-zero agent exits, logging a fatal error and exiting immediately on hard failures. - Clarifies log messages to separate successful agent exits from unexpected crashes.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/ray/raylet/node_manager.cc | Wraps shutdown_raylet_gracefully_ in a lambda that sets is_shutdown_request_received_ . |
src/ray/raylet/agent_manager.cc | Adds branch for non-zero agent exit to RAY_LOG(FATAL) and refines the zero-exit log and reason. |
Comments suppressed due to low confidence (3)
src/ray/raylet/agent_manager.cc:87
- The message 'failed with exit code 0' can be confusing since exit code 0 usually means success; consider rephrasing to 'exited with code 0' for clarity.
RAY_LOG(ERROR) << "The raylet exited immediately because one Ray agent failed with exit code 0"
src/ray/raylet/agent_manager.cc:111
- [nitpick] Include the non-zero
exit_code
in the FATAL log message to provide more diagnostic context, matching the detailedreason_message
in the graceful path.
RAY_LOG(FATAL) << "The raylet exited immediately because agent '" << options_.agent_name
src/ray/raylet/agent_manager.cc:109
- Add a unit or integration test that simulates a non-zero agent exit to verify that the FATAL logging and immediate shutdown behavior work as expected.
// Agent exited with a non-zero code, indicating a hard failure.
src/ray/raylet/node_manager.cc
Outdated
is_shutdown_request_received_ = true; | ||
shutdown_raylet_gracefully_(death_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The lambda wrapping shutdown_raylet_gracefully_
is duplicated in two agent manager factories. Consider extracting it into a shared helper to reduce code duplication and improve readability.
is_shutdown_request_received_ = true; | |
shutdown_raylet_gracefully_(death_info); | |
HandleNodeDeath(death_info); |
Copilot uses AI. Check for mistakes.
As discussed offline:
So let's split the changes into three PRs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, your solution will work. Good work on the find and root cause!
I have a few followups that should be created and prioritized. I think @edoakes and you've covered them already, but I'll add them here for completeness:
-
During graceful shutdown, the raylet should (ideally) transition to a DRAIN state and keep heartbeating until the last step of the shutdown process. The principle being that the data plane shuts down before the control plane.
-
The GCS cannot reliably send a KILL signal to a node that it thinks is already dead. It’s okay as best-effort STONITH, but if a raylet discovers that it’s heartbeating connection has been terminated by the GCS, it should commit suicide.
🚢 @jjyao should still review since he's most familiar with the GCS -> NodeManager shutdown path.
src/ray/raylet/agent_manager.cc
Outdated
RAY_LOG(ERROR) | ||
<< "The raylet exited immediately because one Ray agent failed with exit code 0" | ||
<< ", agent_name = " << options_.agent_name << ".\n" | ||
<< "The raylet fate shares with the agent. This can happen because\n" | ||
"- The version of `grpcio` doesn't follow Ray's requirement. " | ||
"Agent can segfault with the incorrect `grpcio` version. " | ||
"Check the grpcio version `pip freeze | grep grpcio`.\n" | ||
"- The agent failed to start because of unexpected error or port " | ||
"conflict. Read the log `cat " | ||
"/tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. " | ||
"You can find the log file structure here " | ||
"https://docs.ray.io/en/master/ray-observability/user-guides/" | ||
"configure-logging.html#logging-directory-structure.\n" | ||
"- The agent is killed by the OS (e.g., out of memory)."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line has a few problems
- @edoakes are grpcio version issues and port conflicts still common root causes for why the dashboard can't start up? There was a customer issue where the user was confused about why they were seeing grpcio in the logs.
- At least a few of the causes her (segfault, OOMed) look like they should not exit with code 0.
I'd suggest removing misleading suggestions and adding the ungraceful ones to the FATAL error.
src/ray/raylet/node_manager.cc
Outdated
[this](const rpc::NodeDeathInfo &death_info) { | ||
is_shutdown_request_received_ = true; | ||
shutdown_raylet_gracefully_(death_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use this->is_shutdown_request_received_
and this->shutdown_raylet_gracefully_
to make the capture more readable? I think binding lambdas to this
is a code smell and this at least makes it explicit (see #53654 (comment))
src/ray/raylet/agent_manager.cc
Outdated
// If the process is not terminated within 10 seconds, forcefully kill raylet | ||
// itself. | ||
delay_executor_([]() { QuickExit(); }, /*ms*/ 10000); | ||
if (exit_code == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I glanced at the code inside process.cc
and it's not obvious to me that waitpid is actually called correctly. If you haven't already investigated it, it might be worth glancing over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look into that. waitpid
is not called in most cases because we launch the agent with a pipe to stdin. Unless the pipe creation failed, Process::Wait()
will return status based on pipe-based waiting (fd != -1
).
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
c040b1c
to
48b6bc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
It makes sense to me that we don't need to gracefully shut down raylet when agent died ungracefully.
## Why are these changes needed? See #53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number #53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
) ## Why are these changes needed? See ray-project#53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number ray-project#53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
) ## Why are these changes needed? See ray-project#53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number ray-project#53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Why are these changes needed?
See #53739 for more details. Here's the sequence of events that leads to the flaky behavior:
SIGKILL
to an agent process (e.g.,dashboard_agent
).waitpid
status, theAgentManager
triggers a graceful shutdown of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs:Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION
.DEAD
, and notifies the raylet of this status change.DEAD
status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs:[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS
.Proposed Fix
In
node_manager.cc
, when creating theAgentManager
instances, theshutdown_raylet_gracefully_
callback should be wrapped in a lambda that first setsis_shutdown_request_received_ = true
. This will prevent theNodeRemoved
callback from incorrectly triggering a fatal crash during an intentional graceful shutdown.Related issue number
#53739