TensorFlow: scale the gradients of local variables #3719

MrAta · 2022-09-27T21:28:17Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR is a followup to #3695 and the discussion made here to scale down the gradients of local variables in partial distributed gradient tape, distributed optimizer, and the local gradient aggregators.

Fixes # (issue).
#3705

MrAta · 2022-09-27T21:29:06Z

cc @skyw @plliao; would appreciate your review here.

test/parallel/test_tensorflow.py

horovod/tensorflow/__init__.py

github-actions · 2022-09-28T12:33:39Z

Unit Test Results

    792 files -   437     792 suites - 437 9h 8m 23s ⏱️ - 3h 33m 38s
    840 tests ±      0     717 ✔️ -     69   123 💤 +    70 0 ❌ - 1
16 685 runs - 8 807 11 222 ✔️ - 6 763 5 463 💤 - 2 043 0 ❌ - 1

Results for commit 7d6ef2d. ± Comparison against base commit 6954391.

♻️ This comment has been updated with latest results.

github-actions · 2022-09-28T12:33:53Z

Unit Test Results (with flaky tests)

    894 files -     530     894 suites - 530 9h 53m 52s ⏱️ - 4h 10m 26s
    840 tests ±        0     717 ✔️ -     68   123 💤 +    70 0 ❌ - 2
18 607 runs - 10 898 12 250 ✔️ - 7 887 6 357 💤 - 3 009 0 ❌ - 2

Results for commit 7d6ef2d. ± Comparison against base commit 6954391.

♻️ This comment has been updated with latest results.

skyw · 2022-09-28T17:26:10Z

Should we make it optional? I like it to be the default behavior because that is the use case I care about.

But still, there are model parallel cases in which local gradient doesn't need to be scaled. Do we need to care about those case right now?

MrAta · 2022-09-29T18:25:59Z

Should we make it optional? I like it to be the default behavior because that is the use case I care about.

But still, there are model parallel cases in which local gradient doesn't need to be scaled. Do we need to care about those case right now?

@skyw are there any usecases that strictly require not scaling the local gradients?
I'm asking because making it optional requires adding a new argument/flag to the current DistributedOptimizer() which I'm not sure I should do that or not given there's no plan for major version upgrade right now. @romerojosh what do you think?

romerojosh · 2022-09-29T18:37:35Z

I think adding a new flag with a default to DistributedOptimizer is a good idea. I have marked this PR to be included in our v0.26.0 release so it is ok to add an argument. The default can be to scale the local gradients by the number of workers.

horovod/_keras/__init__.py

romerojosh · 2022-10-04T18:30:57Z

@MrAta It seems the test failures are unrelated to this PR (spark and Elastic tests). Also, the docs are failing to build, but I can't quite tell if there is anything wrong with those changes either.

@EnricoMi @maxhgerlach Any ideas?

MrAta · 2022-10-04T18:53:18Z

Hi @romerojosh, yes seems in spark tests, horovod is not initialized :-?

/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/keras/__init__.py:136: in DistributedOptimizer
    horovod_size = size_op(process_set_id=process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else process_set.size()
/usr/local/lib/python3.8/dist-packages/horovod/common/process_sets.py:56: in size
    return _basics._process_set_size(self.process_set_id)
...
if result == self.HOROVOD_PROCESS_SET_ERROR_INIT:
>           raise ValueError('Horovod has not been initialized; use hvd.init().')
E           ValueError: Horovod has not been initialized; use hvd.init().

https://github.com/horovod/horovod/actions/runs/3161773548/jobs/5150494743
I'll take a closer look later today.

EnricoMi · 2022-10-04T18:58:11Z

Follow the this check link in the test results above:

See the "Raw Output" of the first failure:

        if self.even_set.included():
            self.assertAlmostEqual(computed_value,
                                   sum(range(0, size, 2)) / self.even_set.size())
        else:
>           self.assertAlmostEqual(computed_value, float(hvd.rank()))
E           TypeError: type numpy.ndarray doesn't define __round__ method

test_tensorflow2_keras_process_sets.py:93: TypeError

I think this is the best place to start.

maxhgerlach · 2022-10-05T08:33:38Z

Also, the docs are failing to build, but I can't quite tell if there is anything wrong with those changes either.

There are a couple of warning messages (https://readthedocs.org/projects/horovod/builds/18218861/):

/home/docs/checkouts/readthedocs.org/user_builds/horovod/checkouts/3719/horovod/tensorflow/keras/__init__.py:docstring of horovod.tensorflow.keras.DistributedOptimizer:46: WARNING: Definition list ends without a blank line; unexpected unindent.
/home/docs/checkouts/readthedocs.org/user_builds/horovod/checkouts/3719/docs/tensorflow.rst:55: WARNING: Enumerated list ends without a blank line; unexpected unindent.

Sounds like the doc string formatting for horovod.tensorflow.keras.DistributedOptimizer throws off the build.

@MrAta, if you want to try building the API docs locally, there are some pointers here: https://horovod.readthedocs.io/en/stable/contributors_include.html#documentation

8000

maxhgerlach · 2022-10-10T12:52:26Z

There's a Spark test failure now that seem to stem from calling process_set.size() although Horovod hasn't been initialized, via

horovod_size = size_op(process_set_id=process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else process_set.size()

Extract from log:

    def _process_set_size(self, process_set_id: int) -> int:
        """ Return size of the process set with the given id. """
        assert isinstance(process_set_id, int)
        result = int(self.MPI_LIB_CTYPES.horovod_process_set_size(
            ctypes.c_int(process_set_id)))
        if result == self.HOROVOD_PROCESS_SET_ERROR_INIT:
>           raise ValueError('Horovod has not been initialized; use hvd.init().')
E           ValueError: Horovod has not been initialized; use hvd.init().

So I think the call to process_set.size() should be moved to when the gradient is actually scaled. I'd suggest to go with the boolean option scale_local_gradients=True to control this (as in an earlier form of this PR).

Scaling by an arbitrary, fixed float factor (independent of the process set size) is really a separate concern IMHO.

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

MrAta · 2022-10-10T22:05:40Z

Thanks for the inputs @maxhgerlach @EnricoMi! Seems all the tests pass now.

romerojosh added this to the v0.26.0 milestone Sep 27, 2022

MrAta force-pushed the gradientscaling branch from cc7abab to ff0b953 Compare September 27, 2022 22:27

plliao suggested changes Sep 28, 2022

View reviewed changes

test/parallel/test_tensorflow.py Outdated Show resolved Hide resolved

horovod/tensorflow/__init__.py Show resolved Hide resolved

MrAta force-pushed the gradientscaling branch from eae8f46 to 9ac1e47 Compare September 29, 2022 21:34

plliao suggested changes Sep 30, 2022

View reviewed changes

horovod/_keras/__init__.py Show resolved Hide resolved

plliao approved these changes Sep 30, 2022

View reviewed changes

MrAta force-pushed the gradientscaling branch 2 times, most recently from a96ab85 to fcc31be Compare October 10, 2022 03:10

MrAta added 11 commits October 10, 2022 09:56

scale gradients of local variables

b6ba61c

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

update unit tests

8b4ec9e

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

fix for legacy optimizer

8bde2e6

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

improve the unittest

f30ac4e

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

fix for legacy optimizer

6680688

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

make local gradient scaling optional; defautl is set to true

50e0267

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

update docs

989e22d

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

change scaling factor to float

0caf4cd

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

fix docs

9fd7b20

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

set back the removed process set arg

f2db862

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

revert to use boolean for local gradient scaling

7d6ef2d

Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>

MrAta force-pushed the gradientscaling branch from fcc31be to 7d6ef2d Compare October 10, 2022 18:17

romerojosh approved these changes Oct 11, 2022

View reviewed changes

romerojosh merged commit d29bc04 into horovod:master Oct 11, 2022

romerojosh mentioned this pull request Nov 30, 2022

Handle tf.IndexedSlices types when scaling local gradients in TF. #3786

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorFlow: scale the gradients of local variables #3719

TensorFlow: scale the gradients of local variables #3719

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TensorFlow: scale the gradients of local variables #3719

TensorFlow: scale the gradients of local variables #3719

Uh oh!

Conversation

Uh oh!

Checklist before submitting

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Unit Test Results (with flaky tests)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!