Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in `load_checkpoint_in_model` #3432

ringohoffman · 2025-03-10T08:47:01Z

What does this PR do?

load_checkpoint_in_model now supports loading into FSDP2-wrapped or Tensor Parallelized models when using device_map=None

for large models in a distributed setting, by leveraging broadcast_from_rank0, the reduced file system reads results in much faster loading (for loading a 70B model on a single node of 8 GPUs, 60 seconds vs 90 seconds)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @muellerzr @BenjaminBossan

…oad_checkpoint_in_model load_checkpoint_in_model now supports loading into FSDP2-wrapped models when using device_map=None for large models in a distributed setting, by leveraging broadcast_from_rank0, the reduced file system reads results in much faster loading (for loading a 70B model on a single node of 8 GPUs, 60 seconds vs 90 seconds)

HuggingFaceDocBuilderDev · 2025-03-11T16:25:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for the PR ! This is a nice functionality ! Left a few comments. Can you have a second look @muellerzr ? Could you also fix the CI ? There are a lot of failing tests currently due to this PR. Also, mayb we can move the tests to the test_fsdp.py test file ?

SunMarc · 2025-03-11T16:19:45Z

src/accelerate/utils/modeling.py

-import torch.nn as nn
+from torch import distributed as dist
+from torch import nn
+from torch.distributed.checkpoint.state_dict import StateDictOptions, set_model_state_dict


is this available starting torch 2.0 ?

Yeah torch.distributed.checkpoint is new in 2.0.0.

I think StateDictOptions and set_model_state_dict were first added in 2.2.0 though.

https://github.com/pytorch/pytorch/blob/v2.2.0/torch/distributed/checkpoint/state_dict.py#L81
https://github.com/pytorch/pytorch/blob/v2.2.0/torch/distributed/checkpoint/state_dict.py#L761

I'll move this down into the function and guard with is_torch_version(">=", "2.2.0")

SunMarc · 2025-03-11T16:23:15Z

src/accelerate/utils/modeling.py

+        full_state_dict (`bool`, *optional*, defaults to `True`): if this is set to `True`, all the tensors in the
+            returned state_dict will be gathered. No ShardedTensor and DTensor will be in the returned state_dict.
+        broadcast_from_rank0 (`bool`, *optional*, defaults to `True`): when the option is `True`, rank0 should receive
+            a full state_dict and will broadcast the tensors in the state_dict one by one to other ranks. Other ranks
+            will receive the tensors and shard according to the local shards in the model. `full_state_dict` must be
+            set to `True` when using this option.


specify that these are for fsdp/tp only

set_model_state_dict isn't only for FSDP and TP. It handles non-distributed loading also. You can even use this with DDP. I'll add a test demonstrating this.

I will update the note to mention that a ProcessGroup must be initialized if broadcast_from_rank0=True, and I will change its default to False.

SunMarc · 2025-03-11T16:43:19Z

src/accelerate/utils/modeling.py

+            loaded_checkpoint = (
+                load_state_dict(checkpoint_file, device_map=device_map)
+                if (not broadcast_from_rank0 or dist.is_initialized() and dist.get_rank() == 0)
+                else {}
+            )
+            set_model_state_dict(
+                model,
+                loaded_checkpoint,
+                options=StateDictOptions(
+                    full_state_dict=full_state_dict,
+                    strict=strict,
+                    broadcast_from_rank0=broadcast_from_rank0,
+                ),
+            )


Let's do the following if distributed is initialized ! I'm not sure set_model_state_dict will work if it is not initialized, especially since we set broadcast_from_rank0 to True by default. if a device_map is passed when loading in distributed env, we can raise a warning/error for instance.

Also, we can maybe initialize a PartialState here instead of calling dist.is_initialized() and dist.get_rank()

set_model_state_dict does work when distributed is not initialized, but broadcast_from_rank0=True doesn't work when distributed isn't initialized. To your point, I think the safest thing to do may be to default broadcast_from_rank0 to False instead.

Also I'll add a test demonstrating that set_model_state_dict does work in a non-distributed context (when broadcast_from_rank0=False)

SunMarc · 2025-03-11T16:44:17Z

src/accelerate/test_utils/testing.py

+def pytest_xdist_worker_id():
+    """
+    Returns an int value of worker's numerical id under `pytest-xdist`'s concurrent workers `pytest -n N` regime, or 0
+    if `-n 1` or `pytest-xdist` isn't being used.
+    """
+    worker = os.environ.get("PYTEST_XDIST_WORKER", "gw0")
+    worker = re.sub(r"^gw", "", worker, 0, re.M)
+    return int(worker)
+
+
+def get_torch_dist_unique_port():
+    """
+    Returns a port number that can be fed to `torch.distributed.launch`'s `--master_port` argument.
+
+    Under `pytest-xdist` it adds a delta number based on a worker id so that concurrent tests don't try to use the same
+    port at once.
+    """
+    port = 29500
+    uniq_delta = pytest_xdist_worker_id()
+    return port + uniq_delta
+
+


is there a way to not have this @muellerzr ?

BTW this is just taken from transformers here:

https://github.com/huggingface/transformers/blob/81aa9b2e07b359cd3555c118010fd9f26c601e54/src/transformers/testing_utils.py#L2337-L2356

SunMarc · 2025-03-11T16:51:54Z

tests/test_load_checkpoint_and_dispatch_with_broadcast.py

+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import itertools
+import unittest
+from typing import Any, Callable
+
+import torch
+from huggingface_hub import hf_hub_download
+from torch import distributed as dist
+from torch import nn
+from torch.distributed._composable.fsdp import fully_shard
+from torch.distributed._tensor import DTensor
+from torch.distributed.device_mesh import init_device_mesh
+from torch.distributed.fsdp.wrap import _recursive_wrap, transformer_auto_wrap_policy
+from transformers import AutoConfig, AutoModel


can we also add a tensor parallel test since you talk about it in the PR ?

…>=', '2.2.0') This should fix issues with slow import and also fixes versioning issues huggingface#3432 (comment) huggingface#3432 (comment)

…patch(device_map=None) using set_model_state_dict huggingface#3432 (comment) huggingface#3432 (comment)

…_dict

…n version of torch to test as 2.4.0

ringohoffman · 2025-03-12T21:01:31Z

I think everything should be passing now if you want to give it another go! @SunMarc

A3E2

SunMarc

Thanks for iterating ! Just a few nits ! Can you have a second look @muellerzr for the tests in particular ? Also, the CI is red bacause of your cahnges, can you have a quick look a these tests ?

FAILED 
tests/test_accelerator.py::AcceleratorTester::test_save_model_offload_use_pytorch - AssertionError
FAILED tests/test_accelerator.py::AcceleratorTester::test_save_model_offload_use_safetensors - AssertionError

src/accelerate/utils/modeling.py

SunMarc · 2025-03-13T10:56:19Z

tests/test_load_checkpoint_and_dispatch_with_broadcast.py

+
+class TestLoadCheckpointAndDispatchWithBroadcast(unittest.TestCase):
+    @require_transformers
+    @require_multi_gpu


you can put the decorators above the class name !

SunMarc · 2025-03-13T10:58:04Z

tests/test_load_checkpoint_and_dispatch_with_broadcast.py

+            else:
+                torch.testing.assert_close(tensor, tp_tensor, msg=tp_name)
+
+    @require_torch_min_version(version="2.4.0")


let's put the require_torch_min_version decorator above TestLoadCheckpointAndDispatchWithBroadcast

SunMarc · 2025-03-13T10:59:27Z

tests/test_load_checkpoint_and_dispatch_with_broadcast.py

+if is_transformers_available():
+    from transformers import AutoConfig, AutoModel, PreTrainedModel
+    from transformers.models.gpt2.modeling_gpt2 import GPT2Block
+
+    def manage_process_group(func: Callable[..., Any]) -> Callable[..., Any]:
+        """Manage the creation and destruction of the distributed process group for the wrapped function."""
+
+        def wrapped(*args: Any, **kwargs: Any) -> Any:
+            dist.init_process_group(world_size=torch.cuda.device_count())
+            try:
+                return func(*args, **kwargs)
+            finally:
+                dist.destroy_process_group()
+
+        return wrapped
+
+    @require_torch_min_version(version="2.4.0")
+    @manage_process_group
+    def load_checkpoint_and_dispatch_fsdp2():
+        torch.cuda.set_device(device := torch.device(dist.get_rank()))
+


you don't need to put all these function inside the condition if is_transformers_available():

set_model_state_dict will fail if the model state_dict is not on at most one device

* https://github.com/huggingface/accelerate/pull/3432/files#r1993272280 * https://github.com/huggingface/accelerate/pull/3432/files#r1993268932

https://github.com/huggingface/accelerate/pull/3432/files#r1993275663

ringohoffman · 2025-03-13T11:45:00Z

Thanks for iterating ! Just a few nits ! Can you have a second look @muellerzr for the tests in particular ? Also, the CI is red bacause of your cahnges, can you have a quick look a these tests ?
FAILED 
tests/test_accelerator.py::AcceleratorTester::test_save_model_offload_use_pytorch - AssertionError
FAILED tests/test_accelerator.py::AcceleratorTester::test_save_model_offload_use_safetensors - AssertionError

So weird, I remember solving these failures earlier but I guess I didn't push it...

cdf321b

SunMarc

Thanks a lot !

SunMarc · 2025-03-13T15:52:16Z

src/accelerate/utils/modeling.py

        if device_map is None:
-            model.load_state_dict(loaded_checkpoint, strict=strict)
+            if is_torch_version(">=", "2.2.0") and len(model_devices) <= 1:


why this is needed len(model_devices) <= 1: ? usually, the model is on meta device but the non persistant buffer are usually not on the meta device

I added a comment explaining this. This is just an explicit restriction that set_model_state_dict has. Starting in v2.7.0 (not yet released), they actually do support one physical device + the meta device.

I've updated the condition to account for this.

https://github.com/pytorch/pytorch/blob/v2.6.0/torch/distributed/checkpoint/state_dict.py#L557-L563
https://github.com/pytorch/pytorch/blob/v2.7.0-rc2/torch/distributed/checkpoint/state_dict.py#L575-L587

Check it out f5555fb!

Also, non-persistent buffers are not included in the state_dict by definition, so they wouldn't affect this check.

… device restrictions

ringohoffman · 2025-03-28T17:35:59Z

@muellerzr @SunMarc any progress here? How is this looking?

SunMarc

the tensor parallel example shouldn't work anymore unfortunately due to changed in transformers

SunMarc · 2025-04-01T08:57:21Z

tests/test_load_checkpoint_and_dispatch_with_broadcast.py

+    with device, init_empty_weights():
+        config = AutoConfig.from_pretrained(pretrained_model_name_or_path)
+        tp_model = AutoModel.from_config(config)
+        tp_model.tie_weights()
+        assert isinstance(tp_model, nn.Module)
+
+    mesh = init_device_mesh(device.type, (dist.get_world_size(),))
+    assert tp_model.supports_tp_plan
+    assert callable(tp_model.tensor_parallel)
+    tp_model.tensor_parallel(mesh)


The api for tensor_parallel changed a bit in transformers. Not sure we need this example anymore.

Sure I'll remove this test.

SunMarc · 2025-04-01T08:58:05Z

I will merge it soon but after today's release as I prefer to not include in it yet.

SunMarc · 2025-04-11T15:02:15Z

Sorry for wait, merging.

SunMarc reviewed Mar 11, 2025

View reviewed changes

Matthew Hoffman added 5 commits March 11, 2025 14:05

Guard torch.distributed.checkpoint.state_dict with is_torch_version('…

6d9414e

…>=', '2.2.0') This should fix issues with slow import and also fixes versioning issues huggingface#3432 (comment) huggingface#3432 (comment)

Add test for non-distributed, TP, and DDP for load_checkpoint_and_dis…

f4399f4

…patch(device_map=None) using set_model_state_dict huggingface#3432 (comment) huggingface#3432 (comment)

Merge branch 'main' into load_checkpoint_in_model-use-set_model_state…

63d2da3

…_dict

Verify minimum version for broadcast_from_rank0

7b2de08

Mark transformers as required for broadcast_from_rank0 tests, mark mi…

81f1242

…n version of torch to test as 2.4.0

SunMarc reviewed Mar 13, 2025

View reviewed changes

SunMarc requested a review from muellerzr March 13, 2025 11:06

Matthew Hoffman added 3 commits March 13, 2025 06:37

Add model_devices guard to set_model_state_dict

cdf321b

set_model_state_dict will fail if the model state_dict is not on at most one device

Move decorators to top of test class

ed1cac2

* https://github.com/huggingface/accelerate/pull/3432/files#r1993272280 * https://github.com/huggingface/accelerate/pull/3432/files#r1993268932

Unindent functions

919b912

https://github.com/huggingface/accelerate/pull/3432/files#r1993275663

SunMarc approved these changes Mar 13, 2025

View reviewed changes

Matthew Hoffman added 2 commits March 13, 2025 15:08

Add condition for w/ explanatory links for set_model_state_dict model…

f5555fb

… device restrictions

Fix distribution of 2.2.0 condition

a70daa6

SunMarc reviewed Apr 1, 2025

View reviewed changes

Matthew Hoffman added 4 commits April 1, 2025 14:07

Remove tensor parallel test

481e4b4

Fix model materialization example

7674247

Fix materialization example

c9322bd

Remove old tensor parallel test

099be7d

ringohoffman mentioned this pull request Apr 2, 2025

FSDP Doesn't Work with model.generate() huggingface/transformers#30228

Closed

4 tasks

SunMarc merged commit 73c2378 into huggingface:main Apr 11, 2025
25 checks passed

Use torch.distributed.checkpoint.state_dict.set_model_state_dict in load_checkpoint_in_model #3432

Use torch.distributed.checkpoint.state_dict.set_model_state_dict in load_checkpoint_in_model #3432

Uh oh!

Conversation

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in `load_checkpoint_in_model` #3432

Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in `load_checkpoint_in_model` #3432