clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

yundai424 · 2022-04-06T05:14:15Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

When clearing the locally aggregated/added gradient, we should rather assign with tf.zeros_like, because substracting by the local aggregated value doesn't work for infinite gradients.

Added a unit test for this. Without the change it'll fail.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

…oid infinite gradient not correctly cleared up Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

chongxiaoc

LGTM

github-actions · 2022-04-08T00:07:18Z

Unit Test Results

    821 files ±  0     821 suites ±0 9h 19m 7s ⏱️ - 5m 24s
    768 tests +  1     725 ✔️ +  1     43 💤 ±0 0 ❌ ±0
18 950 runs +38 13 655 ✔️ +34 5 295 💤 +4 0 ❌ ±0

Results for commit c2e8d20. ± Comparison against base commit 133ef07.

github-actions · 2022-04-08T00:07:36Z

Unit Test Results (with flaky tests)

    913 files +    7     913 suites +7 10h 23m 43s ⏱️ + 35m 33s
    768 tests +    1     724 ✔️ +    1     43 💤 ±  0 1 ❌ ±0
21 392 runs +256 15 209 ✔️ +239 6 181 💤 +16 2 ❌ +1

For more details on these failures, see this check.

Results for commit c2e8d20. ± Comparison against base commit 133ef07.

chongxiaoc · 2022-04-08T02:53:43Z

@EnricoMi Mind having a 2nd pair of eye on this minor change to catch inf?

EnricoMi · 2022-04-08T10:10:55Z

I have no experience with that part of the code, I'd rather like to hear @maxhgerlach's opinion.

EnricoMi · 2022-04-08T10:12:00Z

horovod/tensorflow/gradient_aggregation_eager.py

@@ -119,8 +119,8 @@ def _allreduce_helper(self, grads, vars):
    def _clear_vars(self):
        self.counter.assign(0)
        for idx in self.locally_aggregated_grads.keys():
-            self.locally_aggregated_grads[idx].assign_add(
-                -1 * self.locally_aggregated_grads[idx])


Is this a breaking change for existing user code?

inf - inf is NaN, so semantically it would definitely be an improvement to assign zero here.

However, I suspect that tf.assign_add() has been used here originally to avoid an intermediate extra memory allocation for the result of tf.zeros_like(). In the past I've seen a similar effect actually cause perceivable memory waste.

I wonder if this is still the case or if recent releases of TensorFlow can optimize x.assign(tf.zeros_like(x)) appropriately..

@tgaddair Is there a reason of introducing tf.assign_add() originally?

@maxhgerlach i had the same doubt initially, but according to this, it seems like even the old assign_add created a new buffer. I guess ultimately the proper way is to use the in-place c++ api which is a bit involved than the scope of the pr.

I would assume:

a.assign_add(-1 * a)

consumes the same amount of memory as:

a.assign(tf.zeros_like(a))

since both likely create temporaries, as mentioned by @Tixxx.

Is there a TF API to just set all values of a tensor to a scalar value?

We should go into this route when assigning zero to local gradient. What happens afterwards will be the same for Assign and AssignAdd, the only difference is for assign the update is params.device(d) = update; and for assignAdd it's params.device(d) += update;. In this sense it looks like pretty much in place (w/ a buffer for the update). My understanding could be wrong :)

tgaddair

cc @rb-determined-ai, can you also review? I believe Determined is using this feature, right?

rb-determined-ai

This all seems very reasonable to me. I dug up the original PR where we landed this line of code, back before we upstreamed the feature. I read through the comments there and didn't see any indication of why it was originally written this way. I even pinged @aaron276h who reviewed it, but he hasn't responded yet and I'm about 10 minutes from going on vacation.

yundai424 force-pushed the clear_inf_grad branch from a70f51d to 91598bc Compare April 6, 2022 05:20

clear locally accumulated gradient by assigning with zeros_like to av…

c2e8d20

…oid infinite gradient not correctly cleared up Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

yundai424 force-pushed the clear_inf_grad branch from 91598bc to c2e8d20 Compare April 7, 2022 01:34

chongxiaoc approved these changes Apr 7, 2022

View reviewed changes

chongxiaoc requested review from Tixxx and tgaddair April 7, 2022 20:11

Tixxx approved these changes Apr 7, 2022

View reviewed changes

chongxiaoc requested a review from EnricoMi April 8, 2022 02:53

EnricoMi requested a review from maxhgerlach April 8, 2022 10:09

EnricoMi reviewed Apr 8, 2022

View reviewed changes

tgaddair reviewed Apr 8, 2022

View reviewed changes

rb-determined-ai approved these changes Apr 15, 2022

View reviewed changes

chongxiaoc merged commit 98db066 into horovod:master Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505

Uh oh!

Conversation

Uh oh!

Checklist before submitting

Description

Review process to land

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Unit Test Results

Uh oh!

Unit Test Results (with flaky tests)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!