Fix DataLoader sharding for deepspeed in accelerate #315
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
I noticed that my training using the deepspeed integration in accelerate had some strange behavior, the number of iterations per epoch didn't go down as I increased number of GPUs. Also loss wasn't converging as expected.
After a bunch of debugging, I found that
_prepare_deepspeed(...)
doesn't appear to call_prepare_one(...)
properly. It calls without settingfirst_pass=True
, which means thatprepare_one(...)
skips wrapping the DataLoaders... defeating the whole pointHow I tested
I added logging to my training flow to print out
len(data_loader)
afteraccelerator.prepare(...)
is called.I validated that with this fix, the length is divided by the number of processes, as expected.