8000 [core] Race condition between raylet graceful shutdown and GCS health checks · Issue #53739 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[core] Race condition between raylet graceful shutdown and GCS health checks #53739
Open
@codope

Description

@codope

What happened + What you expected to happen

The tests test_raylet_graceful_exit_upon_agent_exit and test_raylet_graceful_exit_upon_runtime_env_agent_exit are flaky. These tests kill a raylet agent process using agent.kill() and expect the raylet to exit with a zero status code, indicating an graceful shutdown. However, the tests sometimes observe a non-zero exit code. Upon further investigation of raylet and agent logs:

[2025-06-07 00:40:11,362 I 85797 85835] (raylet) agent_manager.cc:82: Agent process with name dashboard_agent exited, exit code 0.  11:38:52 [245/602]
[2025-06-07 00:40:11,362 E 85797 85835] (raylet) agent_manager.cc:86: The raylet exited immediately because one Ray agent failed, agent_name = dashboa
rd_agent.
The raylet fate shares with the agent. This can happen because
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip free
ze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_
agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-direc
tory-structure.
- The agent is killed by the OS (e.g., out of memory).
[2025-06-07 00:40:11,362 I 85797 85797] (raylet) main.cc:307: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION, reason message = da
shboard_agent failed and raylet fate-shares with it.
[2025-06-07 00:40:11,362 I 85797 85797] (raylet) main.cc:310: Shutting down...
[2025-06-07 00:40:11,362 I 85797 85797] (raylet) accessor.cc:518: Unregistering node node_id=d024fa45163f0344671b58fbaf8072f38716aa65ec748a25cf362d58
[2025-06-07 00:40:11,364 I 85797 85797] (raylet) accessor.cc:768: Received notification for node, IsAlive = 0 node_id=d024fa45163f0344671b58fbaf8072f3
8716aa65ec748a25cf362d58
[2025-06-07 00:40:11,401 C 85797 85797] (raylet) node_manager.cc:953: [Timeout] Exiting because this node manager has mistakenly been marked as dead b
y the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/rayci/python/ray/core/src/ray/raylet/raylet(+0xec23e8) [0x5618f11e23e8] ray::operator<<()
/rayci/python/ray/core/src/ray/raylet/raylet(+0xec5657) [0x5618f11e5657] ray::RayLog::~RayLog()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x251914) [0x5618f0571914] ray::raylet::NodeManager::NodeRemoved()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x2ec59a) [0x5618f060c59a] std::_Function_handler<>::_M_invoke()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x5a93fc) [0x5618f08c93fc] ray::gcs::NodeInfoAccessor::HandleNotification()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x6b3b56) [0x5618f09d3b56] std::_Function_handler<>::_M_invoke()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x7db8e6) [0x5618f0afb8e6] EventTracker::RecordExecution()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x7d9cc7) [0x5618f0af9cc7] std::_Function_handler<>::_M_invoke()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x7d95b7) [0x5618f0af95b7] boost::asio::detail::completion_handler<>::do_complete()
/rayci/python/ray/core/src/ray/raylet/raylet(+0xe9ad89) [0x5618f11bad89] boost::asio::detail::scheduler::do_run_one()
/rayci/python/ray/core/src/ray/raylet/raylet(+0xe954e1) [0x5618f11b54e1] boost::asio::detail::scheduler::run()
/rayci/python/ray/core/src/ray/raylet/raylet(+0xe95394) [0x5618f11b5394] boost::asio::io_context::run()
/rayci/python/ray/core/src/ray/raylet/raylet(+0x224431) [0x5618f0544431] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f1846e23083] __libc_start_main
/rayci/python/ray/core/src/ray/raylet/raylet(+0x2213be) [0x5618f05413be] _start

Based on the logs, here is what's happening:

  1. Agent Exits with Code 0: The log shows Agent process with name dashboard_agent exited, exit code 0. This is the first critical point. The test is killing the agent with SIGKILL, which should result in a non-zero exit status. An exit code of 0 means the process terminated normally. This is a contradiction, but let's follow the consequences for now.
  2. Graceful Shutdown Initiated: Because the agent exited with what appears to be a normal exit code (0), the agent_manager in the raylet triggers a graceful shutdown. This is confirmed by the log Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION. A "graceful" shutdown is a slow process: it involves cleaning up resources, terminating worker processes, and de-registering from the GCS.
  3. GCS Timeout and Fatal Crash: Here's the race condition. While the raylet is busy with its slow, graceful shutdown, it stops responding to heartbeats from the GCS (Global Control Store) probably due to timeout. GCS assumes the raylet node is dead, and tells the raylet to consider itself dead. This is what the log means by [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS. This is a fatal, unrecoverable error for the raylet.
  4. Non-Zero Exit: The raylet receives this "you are dead" message from the GCS and immediately crashes, which produces the stack trace. A crash always results in a non-zero exit code.

Why This Is Flaky
The test's outcome depends on which of these two processes wins the race:

Test Fails (exit code != 0): When the GCS timeout happens before the graceful shutdown completes, the raylet crashes, exits with a non-zero code, and your assert exit_code == 0 fails. The logs provided are from a run like this.

Test Passes (exit code == 0): When the graceful shutdown process manages to finish before the GCS timeout hits, the raylet exits cleanly with code 0. Your assert exit_code == 0 then passes.

Versions / Dependencies

master

Reproduction script

https://gist.github.com/codope/a229027d649ad80a9f0aae97be114125

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tci-testcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0