-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[core][collective] Avoid creation of gloo_queue
in race condition
#50132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@ruisearch42 do you have a chance to review this one? |
Hi @HollowMan6 , thanks for the fix! Can you add a unit test for this? |
@ruisearch42 Added |
When there are 2 or more gloo communication groups get created at the same time, the current code which checks `gloo_queue` already exists will experience a race condition. This PR handles this by do a try expect when creating `gloo_queue`, and ignore the expection since it will return back to the loop: ``` Traceback (most recent call last): File runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File runpy.py", line 86, in _run_code exec(code, run_globals) File site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File site-packages/ray/_private/worker.py", line 2772, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File site-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::ActorModelRayActor.fit() File "python/ray/includes/common.pxi", line 87, in ray._raylet.check_status ValueError: Failed to look up actor with name 'gloo_queue'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor. During handling of the above exception, another exception occurred: ray::ActorModelRayActor.fit() File site-packages/ray/util/collective/collective.py", line 148, in init_collective_group _group_mgr.create_collective_group(backend, world_size, rank, group_name) File site-packages/ray/util/collective/collective.py", line 63, in create_collective_group g = GLOOGroup( File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__ self._rendezvous.meet() File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 135, in meet ).remote(1000) File site-packages/ray/actor.py", line 869, in remote return actor_cls._remote(args=args, kwargs=kwargs, **updated_options) File site-packages/ray/actor.py", line 1202, in _remote actor_id = worker.core_worker.create_actor( File "python/ray/includes/common.pxi", line 77, in ray._raylet.check_status ValueError: Actor with name 'gloo_queue' already exists in the namespace d6ea1122-1997-4005-be1e-f37de912a68d ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@@ -40,6 +40,13 @@ def test_two_groups_in_one_cluster(ray_start_regular_shared): | |||
assert ray.get(w2.get_gloo_timeout.remote(name2)) == time2 | |||
|
|||
|
|||
def test_multi_simultaneous_groups_creation_in_one_cluster(ray_start_regular_shared): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, when I tried to run this all tests in this file failed
Looks like this test file is not included in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, I didn't notice the CI part, but I did run this test file locally with pytest on my side, and it should be effective (of course, I need to remove fixtures such as ray_start_regular_shared
to get them tested at my side)
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Keep this active |
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Why are these changes needed?
When there are 2 or more gloo communication groups get created at the same time, 8000 the current code which checks
gloo_queue
already exists will experience a race condition. This PR handles this by do a try expect when creatinggloo_queue
, and ignore the expection since it will return back to the loop:Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.