-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Optimizing the memory usage in multi_head_self_attention
and masked_softmax
#2405
Conversation
multi_head_self_attention
and maske_softmax
multi_head_self_attention
and masked_softmax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! One minor wording fix, and this is good to merge.
self._combined_projection = Linear(input_dim, 2 * attention_dim + values_dim) | ||
|
||
self._scale = (input_dim // num_heads) ** 0.5 | ||
self._output_projection = Linear(values_dim, self._output_dim) | ||
self._attention_dropout = Dropout(attention_dropout_prob) | ||
if attention_dropout_prob > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you decide here? This uses less memory if you specify no dropout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran an experiment just now to compare lambda x: x
, dropout(0)
and dropout(0.5)
. It turns out that dropout(0)
will not consume more memory than lambda x: x
. So, I will change this back.
allennlp/nn/util.py
Outdated
of ``0.0``. This behavior may cause ``NaN`` if this is used as the last layer of a model | ||
that uses categorical cross-entropy loss. | ||
If ``memory_efficient`` is set to true, we will simply use a very large negative number for those | ||
masked positions so that the probabilities of them positions would be approximately 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of them positions/of those positions/
Do we need to keep the other implementation so things don't break? I'd be in favour of just replacing the current |
Yeah, that's a good point. I was wary of changing the behavior, which is why I recommended that @yizhongw do it this way in the first place. The only place this changes things is when the vector is entirely masked. Before, you would get a vector of all zeros. Now, you will get a uniform distribution. What do you think? |
Ah that's annoying. I'm not sure if there's anywhere in the code that we make that assumption, but if there is and we break it by accident, it will be nearly impossible to find.... Probably we should keep it. |
Optimizing the memory usage in
multi_head_self_attention
andmasked_softmax
.Fixes Reduce memory use of multi-head self-attention #2185