8000 clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up by yundai424 · Pull Request #3505 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2022

Conversation

yundai424
Copy link
Contributor
@yundai424 yundai424 commented Apr 6, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3504

When clearing the locally aggregated/added gradient, we should rather assign with tf.zeros_like, because substracting by the local aggregated value doesn't work for infinite gradients.

Added a unit test for this. Without the change it'll fail.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

…oid infinite gradient not correctly cleared up

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Copy link
Collaborator
@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chongxiaoc chongxiaoc requested review from Tixxx and tgaddair April 7, 2022 20:11
@github-actions
Copy link
github-actions bot commented Apr 8, 2022

Unit Test Results

     821 files  ±  0       821 suites  ±0   9h 19m 7s ⏱️ - 5m 24s
     768 tests +  1       725 ✔️ +  1       43 💤 ±0  0 ±0 
18 950 runs  +38  13 655 ✔️ +34  5 295 💤 +4  0 ±0 

Results for commit c2e8d20. ± Comparison against base commit 133ef07.

@github-actions
Copy link
github-actions bot commented Apr 8, 2022

Unit Test Results (with flaky tests)

     913 files  +    7       913 suites  +7   10h 23m 43s ⏱️ + 35m 33s
     768 tests +    1       724 ✔️ +    1       43 💤 ±  0  1 ±0 
21 392 runs  +256  15 209 ✔️ +239  6 181 💤 +16  2 +1 

For more details on these failures, see this check.

Results for commit c2e8d20. ± Comparison against base commit 133ef07.

@chongxiaoc chongxiaoc requested a review from EnricoMi April 8, 2022 02:53
@chongxiaoc
Copy link
Collaborator

@EnricoMi Mind having a 2nd pair of eye on this minor change to catch inf?

@EnricoMi EnricoMi requested a review from maxhgerlach April 8, 2022 10:09
@EnricoMi
Copy link
Collaborator
EnricoMi commented Apr 8, 2022

I have no experience with that part of the code, I'd rather like to hear @maxhgerlach's opinion.

@@ -119,8 +119,8 @@ def _allreduce_helper(self, grads, vars):
def _clear_vars(self):
self.counter.assign(0)
for idx in self.locally_aggregated_grads.keys():
self.locally_aggregated_grads[idx].assign_add(
-1 * self.locally_aggregated_grads[idx])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change for existing user code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inf - inf is NaN, so semantically it would definitely be an improvement to assign zero here.

However, I suspect that tf.assign_add() has been used here originally to avoid an intermediate extra memory allocation for the result of tf.zeros_like(). In the past I've seen a similar effect actually cause perceivable memory waste.

I wonder if this is still the case or if recent releases of TensorFlow can optimize x.assign(tf.zeros_like(x)) appropriately..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Is there a reason of introducing tf.assign_add() originally?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhgerlach i had the same doubt initially, but according to this, it seems like even the old assign_add created a new buffer. I guess ultimately the proper way is to use the in-place c++ api which is a bit involved than the scope of the pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume:

a.assign_add(-1 * a)

consumes the same amount of memory as:

a.assign(tf.zeros_like(a))

since both likely create temporaries, as mentioned by @Tixxx.

Is there a TF API to just set all values of a tensor to a scalar value?

Copy link
Contributor Author
@yundai424 yundai424 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go into this route when assigning zero to local gradient. What happens afterwards will be the same for Assign and AssignAdd, the only difference is for assign the update is params.device(d) = update; and for assignAdd it's params.device(d) += update;. In this sense it looks like pretty much in place (w/ a buffer for the update). My understanding could be wrong :)

Copy link
Collaborator
@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rb-determined-ai, can you also review? I believe Determined is using this feature, right?

Copy link
Contributor
@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems very reasonable to me. I dug up the original PR where we landed this line of code, back before we upstreamed the feature. I read through the comments there and didn't see any indication of why it was originally written this way. I even pinged @aaron276h who reviewed it, but he hasn't responded yet and I'm about 10 minutes from going on vacation.

@chongxiaoc chongxiaoc merged commit 98db066 into horovod:master Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

hvd.DistributedOptimizer gradient accumulation doesn't clean up infinite gradient correctly
8 participants
0