Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` #2405

yizhongw · 2019-01-20T21:12:12Z

Optimizing the memory usage in multi_head_self_attention and masked_softmax.
Fixes Reduce memory use of multi-head self-attention #2185

…d_softmax`. Fixes allenai#2185

matt-gardner

Looks great! One minor wording fix, and this is good to merge.

matt-gardner · 2019-01-21T16:34:57Z

allennlp/modules/seq2seq_encoders/multi_head_self_attention.py

        self._combined_projection = Linear(input_dim, 2 * attention_dim + values_dim)

        self._scale = (input_dim // num_heads) ** 0.5
        self._output_projection = Linear(values_dim, self._output_dim)
-        self._attention_dropout = Dropout(attention_dropout_prob)
+        if attention_dropout_prob > 0:


What did you decide here? This uses less memory if you specify no dropout?

I ran an experiment just now to compare lambda x: x, dropout(0) and dropout(0.5). It turns out that dropout(0) will not consume more memory than lambda x: x. So, I will change this back.

matt-gardner · 2019-01-21T16:37:27Z

allennlp/nn/util.py

-    of ``0.0``. This behavior may cause ``NaN`` if this is used as the last layer of a model
-    that uses categorical cross-entropy loss.
+    If ``memory_efficient`` is set to true, we will simply use a very large negative number for those
+    masked positions so that the probabilities of them positions would be approximately 0.


s/of them positions/of those positions/

DeNeutoy · 2019-01-21T17:07:55Z

Do we need to keep the other implementation so things don't break? I'd be in favour of just replacing the current masked_softmax if it's considerably more memory efficient.

matt-gardner · 2019-01-21T17:47:31Z

Yeah, that's a good point. I was wary of changing the behavior, which is why I recommended that @yizhongw do it this way in the first place. The only place this changes things is when the vector is entirely masked. Before, you would get a vector of all zeros. Now, you will get a uniform distribution. What do you think?

DeNeutoy · 2019-01-21T18:04:16Z

Ah that's annoying. I'm not sure if there's anywhere in the code that we make that assumption, but if there is and we break it by accident, it will be nearly impossible to find.... Probably we should keep it.

…ion.

Optimizing the memory usage in multi_head_self_attention and `maske…

8d2388b

…d_softmax`. Fixes allenai#2185

yizhongw changed the title ~~Optimizing the memory usage in multi_head_self_attention and maske_softmax~~ Optimizing the memory usage in multi_head_self_attention and masked_softmax Jan 20, 2019

Fixes pylint error

5d95c8e

matt-gardner approved these changes Jan 21, 2019

View reviewed changes

yizhongw and others added 2 commits January 23, 2019 13:53

Fixes typo and change the attention dropout back to its original vers…

e60ebd6

…ion.

Merge branch 'master' into optimizing-memory

fd929b7

matt-gardner merged commit b6cc9d3 into allenai:master Jan 24, 2019

yizhongw deleted the optimizing-memory branch 6934 January 24, 2019 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` #2405

Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` #2405

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimizing the memory usage in multi_head_self_attention and masked_softmax #2405

Optimizing the memory usage in multi_head_self_attention and masked_softmax #2405

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` #2405

Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` #2405