[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

MortalHappiness · 2025-06-25T15:29:45Z

Why are these changes needed?

The comment here says "kill node while tasks are still running":

ray/python/ray/tests/test_actor_failures.py

Line 384 in d949d69

# Kill actor node, while the above task is still being executed.

However, the execution time of the echo task is very short, so the test doesn't reliably verify what it intends to. Specifically, if we set @ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0) here:

ray/python/ray/tests/test_actor_failures.py

Line 374 in d949d69

@ray.remote(num_cpus=1, max_restarts=1, max_task_retries=-1)

The test should always fail, since both restarts and task retries are disabled. But without a sleep in the echo function, the test sometimes passes by chance.

To fix this, I made half of the tasks block until receiving a signal from SignalActor.

Added one more node dedicated for the SignalActor so that the node where the SignalActor is located won't be removed during the test.

Related issue number

N/A

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…gic without waiting Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

Copilot

Pull Request Overview

This PR adds a brief delay to the echo method in RestartableActor so that the test test_actor_restart_on_node_failure reliably kills a running task before it returns, addressing flakiness when task execution is too fast.

Introduces time.sleep(0.1) in the echo method to lengthen task runtime
Ensures that task is still in-flight when the node is killed, so failures under no-retry settings consistently trigger

Comments suppressed due to low confidence (1)

python/ray/tests/test_actor_failures.py:379

Verify that the time module is imported at the top of this test file; if it's missing, add import time to avoid a NameError.

            time.sleep(0.1)

Copilot · 2025-06-25T15:30:34Z

python/ray/tests/test_actor_failures.py

        def echo(self, value):
+            time.sleep(0.1)


[nitpick] Consider extracting the sleep duration into a named constant (e.g., ECHO_DELAY_SEC = 0.1) and adding a brief comment explaining why the delay is needed to improve clarity and avoid a magic number.

Suggested change

def echo(self, value):

time.sleep(0.1)

# Delay duration for simulating processing time in the echo method.

ECHO_DELAY_SEC = 0.1

def echo(self, value):

time.sleep(self.ECHO_DELAY_SEC)

edoakes · 2025-06-25T17:37:27Z

python/ray/tests/test_actor_failures.py

@@ -376,6 +376,7 @@ class RestartableActor:
        """An actor that will be reconstructed at most once."""

        def echo(self, value):
+            time.sleep(0.1)


let's make this deterministic using a SignalActor instead

It might be useful to have an example to model these after that uses wait_for_condition and SignalActor because the pattern is so common. Do you have a canonical one in mind?

maybe

ray/python/ray/tests/test_unavailable_actors.py

Line 137 in 8af8dae

def test_actor_unavailable_restarting(ray_start_regular, caller):

Reimplemented with SignalActor. PR description also updated.

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

…test logic without waiting (ray-project#54088) Th 8F6C e comment here says "kill node while tasks are still running": https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L384 However, the execution time of the `echo` task is very short, so the test doesn't reliably verify what it intends to. Specifically, if we set `@ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0)` here: https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L374 The test should always fail, since both restarts and task retries are disabled. But without a sleep in the `echo` function, the test sometimes passes by chance. To fix this, I made half of the tasks block until receiving a signal from `SignalActor`. Added one more node dedicated for the `SignalActor` so that the node where the `SignalActor` is located won't be removed during the test. --------- Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

[Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test lo…

14e46ce

…gic without waiting Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

Copilot AI review requested due to automatic review settings June 25, 2025 15:29

Copilot AI reviewed Jun 25, 2025

View reviewed changes

MortalHappiness assigned jjyao, edoakes and kevin85421 Jun 25, 2025

MortalHappiness requested a review from a team June 25, 2025 15:30

MortalHappiness changed the title ~~[Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting~~ [Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting Jun 25, 2025

edoakes reviewed Jun 25, 2025

View reviewed changes

Reimplement with signal actor

e4b4c2c

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness force-pushed the bugfix/test-actor-restart-node-fail-wrong-no-wait branch from e86e988 to e4b4c2c Compare June 26, 2025 14:16

edoakes approved these changes Jun 26, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Jun 26, 2025

edoakes enabled auto-merge (squash) June 26, 2025 18:04

edoakes merged commit 66d124f into ray-project:master Jun 26, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088

Uh oh!

Conversation

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!