-
Notifications
You must be signed in to change notification settings - Fork 2.3k
TensorFlow: scale the gradients of local variables #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc7abab
to
ff0b953
Compare
Unit Test Results 792 files - 437 792 suites - 437 9h 8m 23s ⏱️ - 3h 33m 38s Results for commit 7d6ef2d. ± Comparison against base commit 6954391. ♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 894 files - 530 894 suites - 530 9h 53m 52s ⏱️ - 4h 10m 26s Results for commit 7d6ef2d. ± Comparison against base commit 6954391. ♻️ This comment has been updated with latest results. |
Should we make it optional? I like it to be the default behavior because that is the use case I care about. But still, there are model parallel cases in which local gradient doesn't need to be scaled. Do we need to care about those case right now? |
@skyw are there any usecases that strictly require not scaling the local gradients? |
I think adding a new flag with a default to |
eae8f46
to
9ac1e47
Compare
@MrAta It seems the test failures are unrelated to this PR (spark and Elastic tests). Also, the docs are failing to build, but I can't quite tell if there is anything wrong with those changes either. @EnricoMi @maxhgerlach Any ideas? |
Hi @romerojosh, yes seems in spark tests, horovod is not initialized :-? /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/keras/__init__.py:136: in DistributedOptimizer
horovod_size = size_op(process_set_id=process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else process_set.size()
/usr/local/lib/python3.8/dist-packages/horovod/common/process_sets.py:56: in size
return _basics._process_set_size(self.process_set_id)
...
if result == self.HOROVOD_PROCESS_SET_ERROR_INIT:
> raise ValueError('Horovod has not been initialized; use hvd.init().')
E ValueError: Horovod has not been initialized; use hvd.init(). https://github.com/horovod/horovod/actions/runs/3161773548/jobs/5150494743 |
Follow the this check link in the test results above: See the "Raw Output" of the first failure:
I think this is the best place to start. |
There are a couple of warning messages (https://readthedocs.org/projects/horovod/builds/18218861/):
Sounds like the doc string formatting for @MrAta, if you want to try building the API docs locally, there are some pointers here: https://horovod.readthedocs.io/en/stable/contributors_include.html#documentation |
a96ab85
to
fcc31be
Compare
There's a Spark test failure now that seem to stem from calling
Extract from log:
So I think the call to Scaling by an arbitrary, fixed float factor (independent of the process set size) is really a separate concern IMHO. |
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
Signed-off-by: Ata FatahiBaarzi <afatahibaarzi@linkedin.com>
fcc31be
to
7d6ef2d
Compare
Thanks for the inputs @maxhgerlach @EnricoMi! Seems all the tests pass now. |
Checklist before submitting
Description
This PR is a followup to #3695 and the discussion made here to scale down the gradients of local variables in partial distributed gradient tape, distributed optimizer, and the local gradient aggregators.
Fixes # (issue).
#3705