fix: load HF model only on rank 0 #544

parthchadha · 2025-06-24T17:26:40Z

What does this PR do ?

Loads model only on rank 0 and then uses fsdp2 set_model_state_dict API to load the weights on other ranks (after the model has been parallelized).
Note that the current PR still leads to GPU OOM for 70B model on 1 node with dtensor, will fix in a separate PR.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

…tate-dict-load

nemo_rl/models/policy/dtensor_policy_worker.py

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

…tate-dict-load

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

…tate-dict-load

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

…tate-dict-load

parthchadha added 2 commits June 18, 2025 23:15

Use set_model_state_dict and load model on rank 0

9c04fe4

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

0887483

…tate-dict-load

parthchadha marked this pull request as ready for review June 24, 2025 17:26

parthchadha requested review from terrykong and SahilJain314 June 24, 2025 17:26

parthchadha added the CI:L0 Run doctests and unit tests label Jun 24, 2025

parthchadha had a problem deploying to nemo-ci June 24, 2025 17:27 — with GitHub Actions Error

terrykong reviewed Jun 24, 2025

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Outdated Show resolved Hide resolved

parthchadha force-pushed the pchadha/large-model-state-dict-load branch from aae142f to ebaeb99 Compare June 24, 2025 17:42

parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 24, 2025

parthchadha had a problem deploying to nemo-ci June 24, 2025 17:43 — with GitHub Actions Error

terrykong reviewed Jun 24, 2025

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Show resolved Hide resolved

Fix use of model_config and remove duplicate args

fcec2db

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha force-pushed the pchadha/large-model-state-dict-load branch from ebaeb99 to fcec2db Compare June 24, 2025 18:11

parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 24, 2025

parthchadha temporarily deployed to nemo-ci June 24, 2025 18:14 — with GitHub Actions Inactive

terrykong previously approved these changes Jun 24, 2025

View reviewed changes

parthchadha enabled auto-merge June 24, 2025 18:42

Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

9b7bd0f

…tate-dict-load

parthchadha added this pull request to the merge queue Jun 24, 2025

Any commits made after this event will not be merged.

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 24, 2025

parthchadha added this pull request to the merge queue Jun 25, 2025

Any commits made after this event will not be merged.

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 25, 2025

Disable nccl shm to fix #564

7235370

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha dismissed terrykong’s stale review via 7235370 June 26, 2025 18:19

Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

dadea4a

…tate-dict-load

parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 26, 2025

parthchadha temporarily deployed to nemo-ci June 26, 2025 18:20 — with GitHub Actions Inactive

parthchadha added 3 commits June 26, 2025 20:48

Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

7871f65

…tate-dict-load

Manually broadcast buffers; disable nccl shm conditionally

8c0e061

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

b38c2ae

…tate-dict-load

terrykong enabled auto-merge June 27, 2025 22:47

terrykong approved these changes Jun 27, 2025

View reviewed changes

terrykong added this pull request to the merge queue Jun 27, 2025

Any commits made after this event will not be merged.

auto-merge was automatically disabled June 28, 2025 02:16
Pull Request is not mergeable

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 28, 2025

chtruong814 added this pull request to the merge queue Jun 28, 2025

Any commits made after this event will not be merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: load HF model only on rank 0 #544

fix: load HF model only on rank 0 #544

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: load HF model only on rank 0 #544

fix: load HF model only on rank 0 #544

Conversation

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!