-
Notifications
You must be signed in to change notification settings - Fork 1.2k
DeepSpeed Revamp #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Revamp #405
Conversation
The documentation is not available anymore as the PR was closed or merged. |
1. Saving 16bit model in zero stage 3 2. zero init in stage 3 support using HFDeepSpeedConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for all the work and very nice new tests!
Can you confirm whether this is fully backward compatible, or if there are any breaking changes, could you document them?
"When `zero3_init_flag` is set, it requires Transformers to be installed. " | ||
"Please run `pip3 install transformers`." | ||
) | ||
from transformers.deepspeed import HfDeepSpeedConfig |
There was a problem hiding 8000 this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, we will want this object to live in Accelerate, not in Transformers. I don't know when is the best time to move it, just putting it as a general comment :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this. The thing is, a weakref to HFDeepSpeedConfig
is created in that file (transformers.deepspeed). This is important only when using ZeRO Stage-3 when we don't want to load Transformer models fully on CPU/GPU and we want to directly partition the model parameters across GPUs. This weakref _hf_deepspeed_config_weak_ref
is used in transformers modeling_utils.py
to check if deepspeed zero stage3 is enabled. If it is enabled DeepSpeed functionality of zero.init
is used to directly partition model parameters across GPUs. It is used by from_config
(when training from scratch) and from_pretrained
(when finetuning) methods.
Snippet in modeling_utils.py
:
if is_deepspeed_zero3_enabled():
import deepspeed
logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
# this immediately partitions the model across all gpus, to avoid the overhead in time
# and memory copying it on CPU or each GPU first
with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
model = cls(config, **kwargs)
else:
model = cls(config, **kwargs)
is_deepspeed_zero3_enabled
from above snippet directly refers to the weakref in transformers.deepspeed
def is_deepspeed_zero3_enabled():
if _hf_deepspeed_config_weak_ref is not None and _hf_deepspeed_config_weak_ref() is not None:
return _hf_deepspeed_config_weak_ref().is_zero3()
else:
return False
Due to above reasons I thought it would be good to let this be part of transformers
repo as it is specifically used only in ZeRO Stage-3 for efficiently loading models that are part of transformers
repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the move, the weakref will disappear and we will rely on the AcceleratorState
to know if zero3 is enabled inside Transformers. Again, not sure when is the right point to do the move (as it will make Accelerate a hard dep of Transformers) but want to flag this is the final destination :-)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
1. Add example to show the usage of config file with revamped deepspeed support. 2. update required deepspeed version to 0.6.5 2. reverting `reinit` change as it is not required, 3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.
1. Changes to support ZeRo Stage-3 Inference support. 2. minor bug fixes. 3. Documentation.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
1. update tests and add new one testing autofill functionality of `prepare` method. 2. fix bug related to zero-3 init related to HFDeepSpeedConfig 3. Update documentation addressing comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for working on this revamp!
What does this PR do?
from_pretrained
callToDo: