Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

romerojosh · 2022-08-02T17:04:36Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR adds a couple of additional functionalities to DistributedGradientTape to enable better gradient handling for model parallel cases.

Added new method register_local_source to DistributedGradientTape: This enables users to mark a source/variable as worker local so Horovod will skip performing any global averaging on gradients associated with that variable during a call to gradients and will instead return the unmodified local gradient. For example:

    tape = hvd.DistributedGradientTape(tape)

    # Register worker local variables (i.e. local source)
    for var in model.trainable_variables:
      if <var is worker local>:
        tape.register_local_source(var)

    # Compute gradients. Any gradient associated with a var passed to register_local_source will not be modified by Horovod.
    gradients = tape.gradient(loss, model.trainable_variables)

Added a new option use_generic_names to the gradients method: In model parallel scenarios, it is often the case that common gradients on workers that need to be allreduced might be generated on different workers using different logical paths/graphs. TF naming scheme depends on the code paths that create the variables and this can lead to naming mismatches and deadlocks. This issue was resolved for Horovod operations on training variables in Add option to strip outer name scope from Horovod ops in TF. #2328 but not the allreduce on gradients. The solution for training variables was to use custom names and also strip the outer name scope applied by TF using the introduced ignore_name_scope option. The use_generic_names option applies the same fix to gradients by supplying custom generic names (e.g. grad_0, grad_1,...) and the ignore_name_scope option to the underlying allreduce calls on the gradients.

github-actions · 2022-08-02T19:05:17Z

Unit Test Results

  1 113 files +  145   1 113 suites +145 11h 23m 47s ⏱️ + 1h 23m 15s
    814 tests ±      0     764 ✔️ ±      0     50 💤 ±    0 0 ❌ ±0
22 543 runs +2 922 16 145 ✔️ +2 258 6 398 💤 +664 0 ❌ ±0

Results for commit 8df094c. ± Comparison against base commit 94ec15c.

♻️ This comment has been updated with latest results.

github-actions · 2022-08-02T19:06:19Z

Unit Test Results (with flaky tests)

  1 221 files +  132   1 221 suites +132 12h 3m 3s ⏱️ + 1h 23m 55s
    814 tests ±      0     764 ✔️ +      1     50 💤 ±    0 0 ❌ - 1
24 979 runs +2 753 17 589 ✔️ +2 097 7 390 💤 +657 0 ❌ - 1

Results for commit 8df094c. ± Comparison against base commit 94ec15c.

♻️ This comment has been updated with latest results.

…ibutedGradientTape for TF. Signed-off-by: Josh Romero <joshr@nvidia.com>

Signed-off-by: Josh Romero <joshr@nvidia.com>

MrAta · 2022-08-03T18:22:12Z

@romerojosh we're using pretty much the same patch for our model parallel training, though we call it PartialDistributedGradientTape since it's a different use case than the default data parallel use case.
What do you think about an PartialDistributedGradientTape(tape, [local_layers]) API which is a wrapper around the DistributedGradientTape with local layers? I'd be happy to raise a PR after this one is merged.

Also, regarding registering local variables, not sure if there will be use cases that need variable level granularity as opposed to layer level granularity, and hence the [local_layers] in PartialDistributedGradientTape(tape, [local_layers]). Any thoughts on that?

romerojosh · 2022-08-03T19:53:09Z

Hey @MrAta, thanks for the comment!

Adding an optional list argument to DistributedGradientTape like DistributedGradientTape(tape, local_vars=[...]) was an alternative approach instead of the register_local_variables I considered, but thought it would be more flexible to allow users to to add variables to this list after creating the tape (perhaps in situations where the variables aren't known at tape creation time).

I think adding a PartialDistributedGradientTape(tape, [local_layers]) wrapper in a follow up PR would be great! In terms of using layers vs variables, at least for this PR, I wanted to use the same level of granularity that Horovod uses internally to schedule the communication operations (so, variables). For the higher level wrapper you mention, I think layers could make more sense from a user convenience standpoint.

MrAta · 2022-08-03T20:20:15Z

Thanks for the insight, @romerojosh. Great, I'll raise a PR for PartialDistributedGradientTape once this PR is merged then.

Signed-off-by: Josh Romero <joshr@nvidia.com>

horovod/tensorflow/__init__.py

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach

Changes look good to me and the new functionality seems very useful!

The one suggestion that came to mind would be to add a small unit test for register_local_source(). However, as it stands we have quite little test coverage for the various existing options of DistributedGradientTape and DistributedOptimizer, so that feels optional to me. 🙂

romerojosh added 2 commits August 3, 2022 11:01

Add register_local_source and use_generic_names funtionality to Distr…

5f6fdbb

…ibutedGradientTape for TF. Signed-off-by: Josh Romero <joshr@nvidia.com>

Update CHANGELOG.md.

5ee37fd

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the tf_mp_gradients branch from 20a5fce to 5ee37fd Compare August 3, 2022 18:03

romerojosh added 2 commits August 3, 2022 14:22

Fixes for TF1.

638f080

Signed-off-by: Josh Romero <joshr@nvidia.com>

Add missing import.

8bb1cd5

Signed-off-by: Josh Romero <joshr@nvidia.com>

MrAta reviewed Aug 4, 2022

View reviewed changes

horovod/tensorflow/__init__.py Outdated Show resolved Hide resolved

romerojosh added 2 commits August 4, 2022 07:36

Fixes for TF1.

18433be

Signed-off-by: Josh Romero <joshr@nvidia.com>

Cleanup.

8df094c

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh requested a review from maxhgerlach August 8, 2022 17:42

MrAta mentioned this pull request Aug 9, 2022

add PartialDistributedGradientTape API #3637

Merged

4 tasks

maxhgerlach approved these changes Aug 10, 2022

View reviewed changes

romerojosh merged commit e1bf78a into horovod:master Aug 11, 2022

MrAta mentioned this pull request Aug 11, 2022

add PartialDistributedGradientTape API #3643

Merged

4 tasks

MrAta mentioned this pull request Aug 23, 2022

TF: Add register_local_var to distributed optimizers and gradient aggregators #3663

4 tasks

This was referenced Sep 8, 2022

add support for local variables for BroadcastGlobalVariablesCallback #3685

Closed

TF: Add register_local_var to distributed optimizers and gradient aggrega… #3695

Merged

MrAta mentioned this pull request Sep 19, 2022

Support registering local variables for BroadcastGlobalVariablesCallback #3703

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

Add register_local_source and use_generic_names funtionality to DistributedGradientTape for TF. #3628

Uh oh!

Conversation

Uh oh!

Checklist before submitting

Description

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Unit Test Results (with flaky tests)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!