10000 fix: reinitialize ray cluster if required by parthchadha · Pull Request #341 · NVIDIA-NeMo/RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: reinitialize ray cluster if required #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 9, 2025

Conversation

parthchadha
Copy link
Contributor

What does this PR do ?

CUDA_VISIBLE_DEVICES=0 uv run python3 examples/...
CUDA_VISIBLE_DEVICES=1 uv run python3 examples/...

The second run was reusing the ray cluster from first run and thereby couldn't find the second gpu. With the fix in this PR we will reinitialize the cluster if cuda_visible_devices tag has changed.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@parthchadha parthchadha requested a review from terrykong May 8, 2025 20:38
@parthchadha parthchadha added the CI:L0 Run doctests and unit tests label May 8, 2025
@parthchadha parthchadha force-pushed the pchadha/ray-custer-reinit branch from fc3b87e to c241704 Compare May 8, 2025 20:40
…ame CUDA_VISIBLE_DEVICES tag

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha force-pushed the pchadha/ray-custer-reinit branch from c241704 to 20ea640 Compare May 8, 2025 20:42
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels May 8, 2025
…cluster

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels May 8, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels May 8, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels May 8, 2025
@parthchadha parthchadha requested a review from terrykong May 8, 2025 22:23
@parthchadha parthchadha enabled auto-merge May 9, 2025 05:06
@parthchadha parthchadha added this pull request to the merge queue May 9, 2025
Merged via the queue into main with commit 3021098 May 9, 2025
21 checks passed
@parthchadha parthchadha deleted the pchadha/ray-custer-reinit branch May 9, 2025 08:51
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jun 10, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L0 Run doctests and unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0