8000 Optimizing the memory usage in `multi_head_self_attention` and `masked_softmax` by yizhongw · Pull Request #2405 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Optimizing the memory usage in multi_head_self_attention and masked_softmax #2405

Merged
merged 4 commits into from
Jan 24, 2019

Conversation

yizhongw
Copy link
Contributor
@yizhongw yizhongw commented Jan 20, 2019

@yizhongw yizhongw changed the title Optimizing the memory usage in multi_head_self_attention and maske_softmax Optimizing the memory usage in multi_head_self_attention and masked_softmax Jan 20, 2019
Copy link
Contributor
@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One minor wording fix, and this is good to merge.

self._combined_projection = Linear(input_dim, 2 * attention_dim + values_dim)

self._scale = (input_dim // num_heads) ** 0.5
self._output_projection = Linear(values_dim, self._output_dim)
self._attention_dropout = Dropout(attention_dropout_prob)
if attention_dropout_prob > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you decide here? This uses less memory if you specify no dropout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran an experiment just now to compare lambda x: x, dropout(0) and dropout(0.5). It turns out that dropout(0) will not consume more memory than lambda x: x. So, I will change this back.

of ``0.0``. This behavior may cause ``NaN`` if this is used as the last layer of a model
that uses categorical cross-entropy loss.
If ``memory_efficient`` is set to true, we will simply use a very large negative number for those
masked positions so that the probabilities of them positions would be approximately 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of them positions/of those positions/

@DeNeutoy
Copy link
Contributor

Do we need to keep the other implementation so things don't break? I'd be in favour of just replacing the current masked_softmax if it's considerably more memory efficient.

@matt-gardner
Copy link
Contributor

Yeah, that's a good point. I was wary of changing the behavior, which is why I recommended that @yizhongw do it this way in the first place. The only place this changes things is when the vector is entirely masked. Before, you would get a vector of all zeros. Now, you will get a uniform distribution. What do you think?

@DeNeutoy
Copy link
Contributor

Ah that's annoying. I'm not sure if there's anywhere in the code that we make that assumption, but if there is and we break it by accident, it will be nearly impossible to find.... Probably we should keep it.

@matt-gardner matt-gardner merged commit b6cc9d3 into allenai:master Jan 24, 2019
@yizhongw yizhongw deleted the optimizing-memory branch 6934 January 24, 2019 22:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0