8000 [core][collective] Avoid creation of `gloo_queue` in race condition by HollowMan6 · Pull Request #50132 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[core][collective] Avoid creation of gloo_queue in race condition #50132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

HollowMan6
Copy link
Contributor

Why are these changes needed?

When there are 2 or more gloo communication groups get created at the same time, 8000 the current code which checks gloo_queue already exists will experience a race condition. This PR handles this by do a try expect when creating gloo_queue, and ignore the expection since it will return back to the loop:

Traceback (most recent call last):
  File runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::ActorModelRayActor.fit()
  File "python/ray/includes/common.pxi", line 87, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'gloo_queue'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

During handling of the above exception, another exception occurred:

ray::ActorModelRayActor.fit()
  File site-packages/ray/util/collective/collective.py", line 148, in init_collective_group
    _group_mgr.create_collective_group(backend, world_size, rank, group_name)
  File site-packages/ray/util/collective/collective.py", line 63, in create_collective_group
    g = GLOOGroup(
  File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__
    self._rendezvous.meet()
  File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 135, in meet
    ).remote(1000)
  File site-packages/ray/actor.py", line 869, in remote
    return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
  File site-packages/ray/actor.py", line 1202, in _remote
    actor_id = worker.core_worker.create_actor(
  File "python/ray/includes/common.pxi", line 77, in ray._raylet.check_status
ValueError: Actor with name 'gloo_queue' already exists in the namespace d6ea1122-1997-4005-be1e-f37de912a68d

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests

@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Jan 30, 2025
@ruisearch42 ruisearch42 self-assigned this Mar 24, 2025
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025
@jjyao
Copy link
Collaborator
jjyao commented Apr 29, 2025

@ruisearch42 do you have a chance to review this one?

@ruisearch42
Copy link
Contributor

Hi @HollowMan6 , thanks for the fix! Can you add a unit test for this?

@HollowMan6
Copy link
Contributor Author

@ruisearch42 Added

When there are 2 or more gloo communication groups get created at the
same time, the current code which checks `gloo_queue` already exists
will experience a race condition. This PR handles this by do a try expect
when creating `gloo_queue`, and ignore the expection since it will return
back to the loop:

```
Traceback (most recent call last):
  File runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::ActorModelRayActor.fit()
  File "python/ray/includes/common.pxi", line 87, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'gloo_queue'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

During handling of the above exception, another exception occurred:

ray::ActorModelRayActor.fit()
  File site-packages/ray/util/collective/collective.py", line 148, in init_collective_group
    _group_mgr.create_collective_group(backend, world_size, rank, group_name)
  File site-packages/ray/util/collective/collective.py", line 63, in create_collective_group
    g = GLOOGroup(
  File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__
    self._rendezvous.meet()
  File site-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 135, in meet
    ).remote(1000)
  File site-packages/ray/actor.py", line 869, in remote
    return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)
  File site-packages/ray/actor.py", line 1202, in _remote
    actor_id = worker.core_worker.create_actor(
  File "python/ray/includes/common.pxi", line 77, in ray._raylet.check_status
ValueError: Actor with name 'gloo_queue' already exists in the namespace d6ea1122-1997-4005-be1e-f37de912a68d
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@@ -40,6 +40,13 @@ def test_two_groups_in_one_cluster(ray_start_regular_shared):
assert ray.get(w2.get_gloo_timeout.remote(name2)) == time2


def test_multi_simultaneous_groups_creation_in_one_cluster(ray_start_regular_shared):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, when I tried to run this all tests in this file failed
Looks like this test file is not included in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I didn't notice the CI part, but I did run this test file locally with pytest on my side, and it should be effective (of course, I need to remove fixtures such as ray_start_regular_shared to get them tested at my side)

Copy link
github-actions bot commented Jun 6, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 6, 2025
@HollowMan6
Copy link
Contributor Author

Keep this active

@github-actions github-actions bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0