-
Notifications
You must be signed in to change notification settings - Fork 2.3k
clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up #3505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a70f51d
to
91598bc
Compare
…oid infinite gradient not correctly cleared up Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
91598bc
to
c2e8d20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Unit Test Results (with flaky tests) 913 files + 7 913 suites +7 10h 23m 43s ⏱️ + 35m 33s For more details on these failures, see this check. Results for commit c2e8d20. ± Comparison against base commit 133ef07. |
@EnricoMi Mind having a 2nd pair of eye on this minor change to catch inf? |
I have no experience with that part of the code, I'd rather like to hear @maxhgerlach's opinion. |
@@ -119,8 +119,8 @@ def _allreduce_helper(self, grads, vars): | |||
def _clear_vars(self): | |||
self.counter.assign(0) | |||
for idx in self.locally_aggregated_grads.keys(): | |||
self.locally_aggregated_grads[idx].assign_add( | |||
-1 * self.locally_aggregated_grads[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking change for existing user code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inf - inf is NaN, so semantically it would definitely be an improvement to assign zero here.
However, I suspect that tf.assign_add()
has been used here originally to avoid an intermediate extra memory allocation for the result of tf.zeros_like()
. In the past I've seen a similar effect actually cause perceivable memory waste.
I wonder if this is still the case or if recent releases of TensorFlow can optimize x.assign(tf.zeros_like(x))
appropriately..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair Is there a reason of introducing tf.assign_add()
originally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxhgerlach i had the same doubt initially, but according to this, it seems like even the old assign_add created a new buffer. I guess ultimately the proper way is to use the in-place c++ api which is a bit involved than the scope of the pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume:
a.assign_add(-1 * a)
consumes the same amount of memory as:
a.assign(tf.zeros_like(a))
since both likely create temporaries, as mentioned by @Tixxx.
Is there a TF API to just set all values of a tensor to a scalar value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should go into this route when assigning zero to local gradient. What happens afterwards will be the same for Assign and AssignAdd, the only difference is for assign the update is params.device(d) = update;
and for assignAdd it's params.device(d) += update;
. In this sense it looks like pretty much in place (w/ a buffer for the update
). My understanding could be wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rb-determined-ai, can you also review? I believe Determined is using this feature, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems very reasonable to me. I dug up the original PR where we landed this line of code, back before we upstreamed the feature. I read through the comments there and didn't see any indication of why it was originally written this way. I even pinged @aaron276h who reviewed it, but he hasn't responded yet and I'm about 10 minutes from going on vacation.
Checklist before submitting
Description
Fixes #3504
When clearing the locally aggregated/added gradient, we should rather assign with
tf.zeros_like
, because substracting by the local aggregated value doesn't work for infinite gradients.Added a unit test for this. Without the change it'll fail.
Review process to land