8000 Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. by romerojosh · Pull Request #3628 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 11, 2022

Conversation

romerojosh
Copy link
Collaborator
@romerojosh romerojosh commented Aug 2, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR adds a couple of additional functionalities to DistributedGradientTape to enable better gradient handling for model parallel cases.

  1. Added new method register_local_source to DistributedGradientTape: This enables users to mark a source/variable as worker local so Horovod will skip performing any global averaging on gradients associated with that variable during a call to gradients and will instead return the unmodified local gradient. For example:
    tape = hvd.DistributedGradientTape(tape)

    # Register worker local variables (i.e. local source)
    for var in model.trainable_variables:
      if <var is worker local>:
        tape.register_local_source(var)

    # Compute gradients. Any gradient associated with a var passed to register_local_source will not be modified by Horovod.
    gradients = tape.gradient(loss, model.trainable_variables)

  1. Added a new option use_generic_names to the gradients method: In model parallel scenarios, it is often the case that common gradients on workers that need to be allreduced might be generated on different workers using different logical paths/graphs. TF naming scheme depends on the code paths that create the variables and this can lead to naming mismatches and deadlocks. This issue was resolved for Horovod operations on training variables in Add option to strip outer name scope from Horovod ops in TF. #2328 but not the allreduce on gradients. The solution for training variables was to use custom names and also strip the outer name scope applied by TF using the introduced ignore_name_scope option. The use_generic_names option applies the same fix to gradients by supplying custom generic names (e.g. grad_0, grad_1,...) and the ignore_name_scope option to the underlying allreduce calls on the gradients.

@github-actions
Copy link
github-actions bot commented Aug 2, 2022

Unit Test Results

  1 113 files  +   145    1 113 suites  +145   11h 23m 47s ⏱️ + 1h 23m 15s
     814 tests ±       0       764 ✔️ ±       0       50 💤 ±    0  0 ±0 
22 543 runs  +2 922  16 145 ✔️ +2 258  6 398 💤 +664  0 ±0 

Results for commit 8df094c. ± Comparison against base commit 94ec15c.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
github-actions bot commented Aug 2, 2022

Unit Test Results (with flaky tests)

  1 221 files  +   132    1 221 suites  +132   12h 3m 3s ⏱️ + 1h 23m 55s
     814 tests ±       0       764 ✔️ +       1       50 💤 ±    0  0  - 1 
24 979 runs  +2 753  17 589 ✔️ +2 097  7 390 💤 +657  0  - 1 

Results for commit 8df094c. ± Comparison against base commit 94ec15c.

♻️ This comment has been updated with latest results.

…ibutedGradientTape for TF.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@MrAta
Copy link
Contributor
MrAta commented Aug 3, 2022

@romerojosh we're using pretty much the same patch for our model parallel training, though we call it PartialDistributedGradientTape since it's a different use case than the default data parallel use case.
What do you think about an PartialDistributedGradientTape(tape, [local_layers]) API which is a wrapper around the DistributedGradientTape with local layers? I'd be happy to raise a PR after this one is merged.

Also, regarding registering local variables, not sure if there will be use cases that need variable level granularity as opposed to layer level granularity, and hence the [local_layers] in PartialDistributedGradientTape(tape, [local_layers]). Any thoughts on that?

@romerojosh
Copy link
Collaborator Author

Hey @MrAta, thanks for the comment!

Adding an optional list argument to DistributedGradientTape like DistributedGradientTape(tape, local_vars=[...]) was an alternative approach instead of the register_local_variables I considered, but thought it would be more flexible to allow users to to add variables to this list after creating the tape (perhaps in situations where the variables aren't known at tape creation time).

I think adding a PartialDistributedGradientTape(tape, [local_layers]) wrapper in a follow up PR would be great! In terms of using layers vs variables, at least for this PR, I wanted to use the same level of granularity that Horovod uses internally to schedule the communication operations (so, variables). For the higher level wrapper you mention, I think layers could make more sense from a user convenience standpoint.

@MrAta
Copy link
Contributor
MrAta commented Aug 3, 2022

Thanks for the insight, @romerojosh. Great, I'll raise a PR for PartialDistributedGradientTape once this PR is merged then.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh romerojosh requested a review from maxhgerlach August 8, 2022 17:42
@MrAta MrAta mentioned this pull request Aug 9, 2022
4 tasks
Copy link
Collaborator
@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me and the new functionality seems very useful!

The one suggestion that came to mind would be to add a small unit test for register_local_source(). However, as it stands we have quite little test coverage for the various existing options of DistributedGradientTape and DistributedOptimizer, so that feels optional to me. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0