-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test Results 1 113 files + 145 1 113 suites +145 11h 23m 47s ⏱️ + 1h 23m 15s Results for commit 8df094c. ± Comparison against base commit 94ec15c. ♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 1 221 files + 132 1 221 suites +132 12h 3m 3s ⏱️ + 1h 23m 55s Results for commit 8df094c. ± Comparison against base commit 94ec15c. ♻️ This comment has been updated with latest results. |
…ibutedGradientTape for TF. Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
20a5fce
to
5ee37fd
Compare
@romerojosh we're using pretty much the same patch for our model parallel training, though we call it Also, regarding registering local variables, not sure if there will be use cases that need variable level granularity as opposed to layer level granularity, and hence the |
Hey @MrAta, thanks for the comment! Adding an optional list argument to I think adding a |
Thanks for the insight, @romerojosh. Great, I'll raise a PR for |
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me and the new functionality seems very useful!
The one suggestion that came to mind would be to add a small unit test for register_local_source()
. However, as it stands we have quite little test coverage for the various existing options of DistributedGradientTape
and DistributedOptimizer
, so that feels optional to me. 🙂
Checklist before submitting
Description
This PR adds a couple of additional functionalities to
DistributedGradientTape
to enable better gradient handling for model parallel cases.register_local_source
toDistributedGradientTape
: This enables users to mark a source/variable as worker local so Horovod will skip performing any global averaging on gradients associated with that variable during a call togradients
and will instead return the unmodified local gradient. For example:use_generic_names
to thegradients
method: In model parallel scenarios, it is often the case that common gradients on workers that need to be allreduced might be generated on different workers using different logical paths/graphs. TF naming scheme depends on the code paths that create the variables and this can lead to naming mismatches and deadlocks. This issue was resolved for Horovod operations on training variables in Add option to strip outer name scope from Horovod ops in TF. #2328 but not the allreduce on gradients. The solution for training variables was to use custom names and also strip the outer name scope applied by TF using the introducedignore_name_scope
option. Theuse_generic_names
option applies the same fix to gradients by supplying custom generic names (e.g.grad_0
,grad_1
,...) and theignore_name_scope
option to the underlying allreduce calls on the gradients.