8000 TensorFlow: scale the gradients of local variables by MrAta · Pull Request #3719 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

TensorFlow: scale the gradients of local variables #3719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 11, 2022

Conversation

MrAta
Copy link
Contributor
@MrAta MrAta commented Sep 27, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR is a followup to #3695 and the discussion made here to scale down the gradients of local variables in partial distributed gradient tape, distributed optimizer, and the local gradient aggregators.

Fixes # (issue).
#3705

@MrAta
Copy link
Contributor Author
MrAta commented Sep 27, 2022

cc @skyw @plliao; would appreciate your review here.

@romerojosh romerojosh added this to the v0.26.0 milestone Sep 27, 2022
@github-actions
Copy link
github-actions bot commented Sep 28, 2022

Unit Test Results

     792 files   -    437       792 suites   - 437   9h 8m 23s ⏱️ - 3h 33m 38s
     840 tests ±       0       717 ✔️  -      69     123 💤 +     70  0  - 1 
16 685 runs   - 8 807  11 222 ✔️  - 6 763  5 463 💤  - 2 043  0  - 1 

Results for commit 7d6ef2d. ± Comparison against base commit 6954391.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
github-actions bot commented Sep 28, 2022

Unit Test Results (with flaky tests)

     894 files   -      530       894 suites   - 530   9h 53m 52s ⏱️ - 4h 10m 26s
     840 tests ±         0       717 ✔️  -      68     123 💤 +     70  0  - 2 
18 607 runs   - 10 898  12 250 ✔️  - 7 887  6 357 💤  - 3 009  0  - 2 

Results for commit 7d6ef2d. ± Comparison against base commit 6954391.

♻️ This comment has been updated with latest results.

@skyw
Copy link
skyw commented Sep 28, 2022

Should we make it optional? I like it to be the default behavior because that is the use case I care about.

But still, there are model parallel cases in which local gradient doesn't need to be scaled. Do we need to care about those case right now?

@MrAta
Copy link
Contributor Author
MrAta commented Sep 29, 2022

Should we make it optional? I like it to be the default behavior because that is the use case I care about.

But still, there are model parallel cases in which local gradient doesn't need to be scaled. Do we need to care about those case right now?

@skyw are there any usecases that strictly require not scaling the local gradients?
I'm asking because making it optional requires adding a new argument/flag to the current DistributedOptimizer() which I'm not sure I should do that or not given there's no plan for major version upgrade right now. @romerojosh what do you think?

@romerojosh
Copy link
Collaborator

I think adding a new flag with a default to DistributedOptimizer is a good idea. I have marked this PR to be included in our v0.26.0 release so it is ok to add an argument. The default can be to scale the local gradients by the number of workers.

@romerojosh
Copy link
Collaborator

@MrAta It seems the test failures are unrelated to this PR (spark and Elastic tests). Also, the docs are failing to build, but I can't quite tell if there is anything wrong with those changes either.

@EnricoMi @maxhgerlach Any ideas?

@MrAta
Copy link
Contributor Author
MrAta commented Oct 4, 2022

Hi @romerojosh, yes seems in spark tests, horovod is not initialized :-?

/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/keras/__init__.py:136: in DistributedOptimizer
    horovod_size = size_op(process_set_id=process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else process_set.size()
/usr/local/lib/python3.8/dist-packages/horovod/common/process_sets.py:56: in size
    return _basics._process_set_size(self.process_set_id)
...
if result == self.HOROVOD_PROCESS_SET_ERROR_INIT:
>           raise ValueError('Horovod has not been initialized; use hvd.init().')
E           ValueError: Horovod has not been initialized; use hvd.init().

https://github.com/horovod/horovod/actions/runs/3161773548/jobs/5150494743
I'll take a closer look later today.

@EnricoMi
Copy link
Collaborator
EnricoMi commented Oct 4, 2022

Follow the this check link in the test results above:

See the "Raw Output" of the first failure:

        if self.even_set.included():
            self.assertAlmostEqual(computed_value,
                                   sum(range(0, size, 2)) / self.even_set.size())
        else:
>           self.assertAlmostEqual(computed_value, float(hvd.rank()))
E           TypeError: type numpy.ndarray doesn't define __round__ method

test_tensorflow2_keras_process_sets.py:93: TypeError

I think this is the best place to start.

@maxhgerlach
Copy link
Collaborator

Also, the docs are failing to build, but I can't quite tell if there is anything wrong with those changes either.

There are a couple of warning messages (https://readthedocs.org/projects/horovod/builds/18218861/):

/home/docs/checkouts/readthedocs.org/user_builds/horovod/checkouts/3719/horovod/tensorflow/keras/__init__.py:docstring of horovod.tensorflow.keras.DistributedOptimizer:46: WARNING: Definition list ends without a blank line; unexpected unindent.
/home/docs/checkouts/readthedocs.org/user_builds/horovod/checkouts/3719/docs/tensorflow.rst:55: WARNING: Enumerated list ends without a blank line; unexpected unindent.

Sounds like the doc string formatting for horovod.tensorflow.keras.DistributedOptimizer throws off the build.

@MrAta, if you want to try building the API docs locally, there are some pointers here: https://horovod.readthedocs.io/en/stable/contributors_include.html#documentation

8000

@MrAta MrAta force-pushed the gradientscaling branch 2 times, most recently from a96ab85 to fcc31be Compare October 10, 2022 03:10
@maxhgerlach
Copy link
Collaborator

There's a Spark test failure now that seem to stem from calling process_set.size() although Horovod hasn't been initialized, via

horovod_size = size_op(process_set_id=process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else process_set.size()

Extract from log:

    def _process_set_size(self, process_set_id: int) -> int:
        """ Return size of the process set with the given id. """
        assert isinstance(process_set_id, int)
        result = int(self.MPI_LIB_CTYPES.horovod_process_set_size(
            ctypes.c_int(process_set_id)))
        if result == self.HOROVOD_PROCESS_SET_ERROR_INIT:
>           raise ValueError('Horovod has not been initialized; use hvd.init().')
E           ValueError: Horovod has not been initialized; use hvd.init().

So I think the call to process_set.size() should be moved to when the gradient is actually scaled. I'd suggest to go with the boolean option scale_local_gradients=True to control this (as in an earlier form of this PR).

Scaling by an arbitrary, fixed float factor (independent of the process set size) is really a separate concern IMHO.

MrAta added 11 commits October 10, 2022 09:56
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
@MrAta
Copy link
Contributor Author
MrAta commented Oct 10, 2022

Thanks for the inputs @maxhgerlach @EnricoMi! Seems all the tests pass now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0