8000 [core] Deflake `test_scheduling.py` in client mode by edoakes · Pull Request #54137 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[core] Deflake test_scheduling.py in client mode #54137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

edoakes
Copy link
Collaborator
@edoakes edoakes commented Jun 26, 2025

test_locality_aware_borrowed_objects is flaking in postmerge: https://buildkite.com/ray-project/postmerge/builds/11069#0197aaa0-d2d1-4a49-b505-8af58a44a57b/6-143

After some local testing/digging, I found that this is caused by the fact that fetch_local=False is not respected for Ray Client, meaning that the object was available on both the head and worker nodes. I modified the test under Ray Client conditions to instead call ray.wait to resolve object locations inside of the borrower task.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 26, 2025
@edoakes edoakes requested a review from a team June 26, 2025 14:37
@edoakes
Copy link
Collaborator Author
edoakes commented Jun 26, 2025

The test failed ~50% of the time locally without the change (as expected). With the change, I ran it 50 times in a row without failure.

@edoakes edoakes enabled auto-merge (squash) June 26, 2025 14:38
@@ -390,6 +390,8 @@ def h(x, y):

def test_locality_aware_leasing_borrowed_objects(ray_start_cluster):
"""Test that a task runs where its dependencies are located for borrowed objects."""
is_ray_client_test = ray._private.client_mode_hook.is_client_mode_enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test run with and without ray client separately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's part of the build that sets RAY_CLIENT_MODE=1 to run all tests w/ client

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as well as the normal build that doesn't)

Comment on lines +430 to +432
# NOTE(edoakes): Ray Client does not respect `fetch_local=False`, so this pulls the
# object to the head node. We instead test a slightly weaker condition by moving the
# `ray.wait` call to resolve the location inside of the borrower task (see above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth creating a follow up for and eventually fixing in ray client?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing ticket for it: #52401

Comment on lines +430 to +432
# NOTE(edoakes): Ray Client does not respect `fetch_local=False`, so this pulls the
# object to the head node. We instead test a slightly weaker condition by moving the
# `ray.wait` call to resolve the location inside of the borrower task (see above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that we don't break ties deterministically in locality scheduling? Is it because we sort by NodeIDs or something and they can be arbitrary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I would actually expect that the task would always be scheduled to the head node, as our default scheduling policy is supposed to prefer local scheduling before spilling back (when not accounting for locality).

My guess is the default policy is not respected when breaking locality ties, but only works in the case where there are no nodes available for locality scheduling.

@israbbani israbbani self-assigned this Jun 26, 2025
@edoakes edoakes merged commit f602b50 into ray-project:master Jun 26, 2025
5 of 6 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
`test_locality_aware_borrowed_objects` is flaking in postmerge:
https://buildkite.com/ray-project/postmerge/builds/11069#0197aaa0-d2d1-4a49-b505-8af58a44a57b/6-143

After some local testing/digging, I found that this is caused by the
fact that `fetch_local=False` is not respected for Ray Client, meaning
that the object was available on both the head and worker nodes. I
modified the test under Ray Client conditions to instead call `ray.wait`
to resolve object locations inside of the borrower task.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0