8000 [Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting by MortalHappiness · Pull Request #54088 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

New issue 8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

MortalHappiness
Copy link
Member
@MortalHappiness MortalHappiness commented Jun 25, 2025

Why are these changes needed?

The comment here says "kill node while tasks are still running":

# Kill actor node, while the above task is still being executed.

However, the execution time of the echo task is very short, so the test doesn't reliably verify what it intends to. Specifically, if we set @ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0) here:

@ray.remote(num_cpus=1, max_restarts=1, max_task_retries=-1)

The test should always fail, since both restarts and task retries are disabled. But without a sleep in the echo function, the test sometimes passes by chance.

To fix this, I made half of the tasks block until receiving a signal from SignalActor.

Added one more node dedicated for the SignalActor so that the node where the SignalActor is located won't be removed during the test.

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…gic without waiting

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@Copilot Copilot AI review requested due to automatic review settings June 25, 2025 15:29
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a brief delay to the echo method in RestartableActor so that the test test_actor_restart_on_node_failure reliably kills a running task before it returns, addressing flakiness when task execution is too fast.

  • Introduces time.sleep(0.1) in the echo method to lengthen task runtime
  • Ensures that task is still in-flight when the node is killed, so failures under no-retry settings consistently trigger
Comments suppressed due to low confidence (1)

python/ray/tests/test_actor_failures.py:379

  • Verify that the time module is imported at the top of this test file; if it's missing, add import time to avoid a NameError.
            time.sleep(0.1)

Comment on lines 378 to 379
def echo(self, value):
time.sleep(0.1)
Copy link
Preview
Copilot AI Jun 25, 2025

[nitpick] Consider extracting the sleep duration into a named constant (e.g., ECHO_DELAY_SEC = 0.1) and adding a brief comment explaining why the delay is needed to improve clarity and avoid a magic number.

Suggested change
def echo(self, value):
time.sleep(0.1)
# Delay duration for simulating processing time in the echo method.
ECHO_DELAY_SEC = 0.1
def echo(self, value):
time.sleep(self.ECHO_DELAY_SEC)

Copilot uses AI. Check for mistakes.

@MortalHappiness MortalHappiness requested a review from a team June 25, 2025 15:30
@MortalHappiness MortalHappiness changed the title [Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting [Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting Jun 25, 2025
@@ -376,6 +376,7 @@ class RestartableActor:
"""An actor that will be reconstructed at most once."""

def echo(self, value):
time.sleep(0.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this deterministic using a SignalActor instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have an example to model these after that uses wait_for_condition and SignalActor because the pattern is so common. Do you have a canonical one in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

def test_actor_unavailable_restarting(ray_start_regular, caller):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimplemented with SignalActor. PR description also updated.

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@MortalHappiness MortalHappiness force-pushed the bugfix/test-actor-restart-node-fail-wrong-no-wait branch from e86e988 to e4b4c2c Compare June 26, 2025 14:16
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 26, 2025
@edoakes edoakes enabled auto-merge (squash) June 26, 2025 18:04
@edoakes edoakes merged commit 66d124f into ray-project:master Jun 26, 2025
6 of 7 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…test logic without waiting (ray-project#54088)

Th
8F6C
e comment here says "kill node while tasks are still running":

https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L384

However, the execution time of the `echo` task is very short, so the
test doesn't reliably verify what it intends to. Specifically, if we set
`@ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0)` here:


https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L374

The test should always fail, since both restarts and task retries are
disabled. But without a sleep in the `echo` function, the test sometimes
passes by chance.

To fix this, I made half of the tasks block until receiving a signal
from `SignalActor`.

Added one more node dedicated for the `SignalActor` so that the node
where the `SignalActor` is located won't be removed during the test.

---------

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0