DeepSpeed Revamp #405

pacman100 · 2022-05-27T14:12:00Z

What does this PR do?

Support for DeepSpeed Config File
Support for gradient clipping and offloading params w/o config file
Fixing scheduler w/o config file
Simplifying model, optimizer and scheduler wrappers by removing stale code of current wrappers and directly relying on DeepSpeed.
Using HFDeepSpeedConfig when in Zero Stage-3 as per user consent to handle Deepspeed ZeRO-3 param gathering and automatically splitting the model onto multiple gpus during from_pretrained call

ToDo:

Write Tests
Redo experiments from Testing Trainer and Accelerate Integration of DeepSpeed to see if gaps between Accelerate integration and Trainer integrations are fixed.
ZeRo Stage-3 inference
Adding example leveraging config-file and related saving and loading of model when using deepspeed.
Documentation

HuggingFaceDocBuilderDev · 2022-05-27T14:19:31Z

The documentation is not available anymore as the PR was closed or merged.

1. Saving 16bit model in zero stage 3 2. zero init in stage 3 support using HFDeepSpeedConfig

sgugger

Thanks a lot for all the work and very nice new tests!
Can you confirm whether this is fully backward compatible, or if there are any breaking changes, could you document them?

src/accelerate/accelerator.py

sgugger · 2022-06-01T19:58:57Z

src/accelerate/accelerator.py

+                        "When `zero3_init_flag` is set, it requires Transformers to be installed. "
+                        "Please run `pip3 install transformers`."
+                    )
+                from transformers.deepspeed import HfDeepSpeedConfig


Ultimately, we will want this object to live in Accelerate, not in Transformers. I don't know when is the best time to move it, just putting it as a general comment :-)

I thought about this. The thing is, a weakref to HFDeepSpeedConfig is created in that file (transformers.deepspeed). This is important only when using ZeRO Stage-3 when we don't want to load Transformer models fully on CPU/GPU and we want to directly partition the model parameters across GPUs. This weakref _hf_deepspeed_config_weak_ref is used in transformers modeling_utils.py to check if deepspeed zero stage3 is enabled. If it is enabled DeepSpeed functionality of zero.init is used to directly partition model parameters across GPUs. It is used by from_config (when training from scratch) and from_pretrained (when finetuning) methods.

Snippet in modeling_utils.py:

if is_deepspeed_zero3_enabled(): import deepspeed logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model") # this immediately partitions the model across all gpus, to avoid the overhead in time # and memory copying it on CPU or each GPU first with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()): model = cls(config, **kwargs) else: model = cls(config, **kwargs)

is_deepspeed_zero3_enabled from above snippet directly refers to the weakref in transformers.deepspeed

def is_deepspeed_zero3_enabled(): if _hf_deepspeed_config_weak_ref is not None and _hf_deepspeed_config_weak_ref() is not None: return _hf_deepspeed_config_weak_ref().is_zero3() else: return False

Due to above reasons I thought it would be good to let this be part of transformers repo as it is specifically used only in ZeRO Stage-3 for efficiently loading models that are part of transformers repo.

With the move, the weakref will disappear and we will rely on the AcceleratorState to know if zero3 is enabled inside Transformers. Again, not sure when is the right point to do the move (as it will make Accelerate a hard dep of Transformers) but want to flag this is the final destination :-)

src/accelerate/accelerator.py

src/accelerate/commands/launch.py

src/accelerate/utils/deepspeed.py

tests/deepspeed/test_deepspeed.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

1. Add example to show the usage of config file with revamped deepspeed support. 2. update required deepspeed version to 0.6.5 2. reverting `reinit` change as it is not required, 3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.

1. Changes to support ZeRo Stage-3 Inference support. 2. minor bug fixes. 3. Documentation.

docs/source/deepspeed.mdx

examples/by_feature/deepspeed_with_config_support.py

src/accelerate/utils/deepspeed.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

1. update tests and add new one testing autofill functionality of `prepare` method. 2. fix bug related to zero-3 init related to HFDeepSpeedConfig 3. Update documentation addressing comments.

docs/source/deepspeed.mdx

sgugger

LGTM, thanks a lot for working on this revamp!

pacman100 added 4 commits May 27, 2022 19:08

deepspeed revamp

7dd85e3

Update dataclasses.py

d15f1b8

Update deepspeed.py

cba962c

Merge branch 'main' into smangrul/deepspeed-revamp

071c651

pacman100 added 11 commits May 27, 2022 19:50

quality

ece85a7

fixing code

8f03205

quality

dbe96fb

FIx imports

7f2eb9e

saving 16bit model in zero stage 3

2ea6d19

1. Saving 16bit model in zero stage 3 2. zero init in stage 3 support using HFDeepSpeedConfig

quality

dc48884

adding test and fixing bugs

3d2a7e0

Merge branch 'main' into smangrul/deepspeed-revamp

74bca05

update makefile for deepspeed tests

2514bab

Update test.yml

0bab647

adding deepspeed as requirement for tests

3494e2c

sgugger approved these changes Jun 1, 2022

View reviewed changes

pacman100 and others added 7 commits June 2, 2022 12:56

Apply suggestions from code review

d7d1e2d

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Merge branch 'main' into smangrul/deepspeed-revamp

f0ce568

quality

ae31c5e

addressing comments

46a7b79

add example and minor updates

4724737

1. Add example to show the usage of config file with revamped deepspeed support. 2. update required deepspeed version to 0.6.5 2. reverting `reinit` change as it is not required, 3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.

Documentation and Zero-3 Inference Support

4cba8c0

1. Changes to support ZeRo Stage-3 Inference support. 2. minor bug fixes. 3. Documentation.

doc fix

d719f6e

pacman100 marked this pull request as ready for review June 3, 2022 11:28

pacman100 changed the title ~~[DO NOT MERGE] deepspeed revamp~~ DeepSpeed revamp Jun 3, 2022

pacman100 changed the title ~~DeepSpeed revamp~~ DeepSpeed Revamp Jun 3, 2022

pacman100 mentioned this pull request Jun 3, 2022

feature: load best model checkpoint #418

Closed

sgugger reviewed Jun 3, 2022

View reviewed changes

Apply suggestions from code review

f62099c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

pacman100 added 2 commits June 6, 2022 16:15

addressing comments

1e893ca

update doc to address comments and bug fixes

3889d56

1. update tests and add new one testing autofill functionality of `prepare` method. 2. fix bug related to zero-3 init related to HFDeepSpeedConfig 3. Update documentation addressing comments.

sgugger reviewed Jun 6, 2022

View reviewed changes

docs/source/deepspeed.mdx Outdated Show resolved Hide resolved

pacman100 added 2 commits June 6, 2022 21:54

removing image and hosting it on documentation-images dataset

2f536ed

check for hidden_size for zero_opt heurisitics

7dd58d4

sgugger approved these changes Jun 6, 2022

View reviewed changes

pacman100 merged commit 1703b79 into huggingface:main Jun 6, 2022

pacman100 mentioned this pull request Jun 10, 2022

Migrate HFDeepSpeedConfig from trfrs to accelerate #432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSpeed Revamp #405

DeepSpeed Revamp #405

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DeepSpeed Revamp #405

DeepSpeed Revamp #405

Uh oh!

Conversation

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!