-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088
New issue
8000Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Minor][Fix][Core/Test] Fix test_actor_restart_on_node_failure wrong test logic without waiting #54088
Conversation
…gic without waiting Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a brief delay to the echo
method in RestartableActor
so that the test test_actor_restart_on_node_failure
reliably kills a running task before it returns, addressing flakiness when task execution is too fast.
- Introduces
time.sleep(0.1)
in theecho
method to lengthen task runtime - Ensures that task is still in-flight when the node is killed, so failures under no-retry settings consistently trigger
Comments suppressed due to low confidence (1)
python/ray/tests/test_actor_failures.py:379
- Verify that the
time
module is imported at the top of this test file; if it's missing, addimport time
to avoid aNameError
.
time.sleep(0.1)
def echo(self, value): | ||
time.sleep(0.1) |
There was a problem hiding this comment.
[nitpick] Consider extracting the sleep duration into a named constant (e.g., ECHO_DELAY_SEC = 0.1
) and adding a brief comment explaining why the delay is needed to improve clarity and avoid a magic number.
def echo(self, value): | |
time.sleep(0.1) | |
# Delay duration for simulating processing time in the echo method. | |
ECHO_DELAY_SEC = 0.1 | |
def echo(self, value): | |
time.sleep(self.ECHO_DELAY_SEC) |
Copilot uses AI. Check for mistakes.
@@ -376,6 +376,7 @@ class RestartableActor: | |||
"""An actor that will be reconstructed at most once.""" | |||
|
|||
def echo(self, value): | |||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this deterministic using a SignalActor
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to have an example to model these after that uses wait_for_condition
and SignalActor
because the pattern is so common. Do you have a canonical one in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
def test_actor_unavailable_restarting(ray_start_regular, caller): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplemented with SignalActor
. PR description also updated.
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
e86e988
to
e4b4c2c
Compare
…test logic without waiting (ray-project#54088) Th 8F6C e comment here says "kill node while tasks are still running": https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L384 However, the execution time of the `echo` task is very short, so the test doesn't reliably verify what it intends to. Specifically, if we set `@ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0)` here: https://github.com/ray-project/ray/blob/d949d69ebcd90a6890892909aade7b1e076dcbfc/python/ray/tests/test_actor_failures.py#L374 The test should always fail, since both restarts and task retries are disabled. But without a sleep in the `echo` function, the test sometimes passes by chance. To fix this, I made half of the tasks block until receiving a signal from `SignalActor`. Added one more node dedicated for the `SignalActor` so that the node where the `SignalActor` is located won't be removed during the test. --------- Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Why are these changes needed?
The comment here says "kill node while tasks are still running":
ray/python/ray/tests/test_actor_failures.py
Line 384 in d949d69
However, the execution time of the
echo
task is very short, so the test doesn't reliably verify what it intends to. Specifically, if we set@ray.remote(num_cpus=1, max_restarts=0, max_task_retries=0)
here:ray/python/ray/tests/test_actor_failures.py
Line 374 in d949d69
The test should always fail, since both restarts and task retries are disabled. But without a sleep in the
echo
function, the test sometimes passes by chance.To fix this, I made half of the tasks block until receiving a signal from
SignalActor
.Added one more node dedicated for the
SignalActor
so that the node where theSignalActor
is located won't be removed during the test.Related issue number
N/A
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.