-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[core] Deflake test_scheduling.py
in client mode
#54137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Deflake test_scheduling.py
in client mode
#54137
Conversation
The test failed ~50% of the time locally without the change (as expected). With the change, I ran it 50 times in a row without failure. |
@@ -390,6 +390,8 @@ def h(x, y): | |||
|
|||
def test_locality_aware_leasing_borrowed_objects(ray_start_cluster): | |||
"""Test that a task runs where its dependencies are located for borrowed objects.""" | |||
is_ray_client_test = ray._private.client_mode_hook.is_client_mode_enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test run with and without ray client separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's part of the build that sets RAY_CLIENT_MODE=1
to run all tests w/ client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as well as the normal build that doesn't)
# NOTE(edoakes): Ray Client does not respect `fetch_local=False`, so this pulls the | ||
# object to the head node. We instead test a slightly weaker condition by moving the | ||
# `ray.wait` call to resolve the location inside of the borrower task (see above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth creating a follow up for and eventually fixing in ray client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an existing ticket for it: #52401
# NOTE(edoakes): Ray Client does not respect `fetch_local=False`, so this pulls the | ||
# object to the head node. We instead test a slightly weaker condition by moving the | ||
# `ray.wait` call to resolve the location inside of the borrower task (see above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that we don't break ties deterministically in locality scheduling? Is it because we sort by NodeIDs or something and they can be arbitrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I would actually expect that the task would always be scheduled to the head node, as our default scheduling policy is supposed to prefer local scheduling before spilling back (when not accounting for locality).
My guess is the default policy is not respected when breaking locality ties, but only works in the case where there are no nodes available for locality scheduling.
`test_locality_aware_borrowed_objects` is flaking in postmerge: https://buildkite.com/ray-project/postmerge/builds/11069#0197aaa0-d2d1-4a49-b505-8af58a44a57b/6-143 After some local testing/digging, I found that this is caused by the fact that `fetch_local=False` is not respected for Ray Client, meaning that the object was available on both the head and worker nodes. I modified the test under Ray Client conditions to instead call `ray.wait` to resolve object locations inside of the borrower task. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
test_locality_aware_borrowed_objects
is flaking in postmerge: https://buildkite.com/ray-project/postmerge/builds/11069#0197aaa0-d2d1-4a49-b505-8af58a44a57b/6-143After some local testing/digging, I found that this is caused by the fact that
fetch_local=False
is not respected for Ray Client, meaning that the object was available on both the head and worker nodes. I modified the test under Ray Client conditions to instead callray.wait
to resolve object locations inside of the borrower task.