8000 calypso transformer by joelgrus · Pull Request #2049 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

calypso transformer #2049

Merged
merged 4 commits into from
Nov 21, 2018
Merged

calypso transformer #2049

merged 4 commits into from
Nov 21, 2018

Conversation

joelgrus
Copy link
Contributor

there's still some potential duplicate code, but it's not exact duplication, so it would take some work to factor out the common parts and I don't know that that's a great use of anyone's time.

for the most part this is just a straight port + adding type annotations and minor cleanup

@joelgrus joelgrus requested a review from brendan-ai2 November 13, 2018 18:58
@brendan-ai2
Copy link
Contributor

Process question: Given that this is mostly raw Calypso code, should this be more on the stamp end of the code review spectrum or should I be more detailed?

@joelgrus
Copy link
Contributor Author

if you see something that you think could be improved, then you should point it out.

but if you have questions like "why did you implement X this way?" the answer is likely going to be "that's how it was implemented in calypso and I just copied the code over" 🤷‍♀️

@@ -0,0 +1,265 @@
"""
The BidirectionalTransformerEncoder from calypso
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should link the annotated transformer tutorial (http://nlp.seas.harvard.edu/2018/04/03/attention.html) here as this implementation appears to be based on that. MIT license conveniently. https://github.com/harvardnlp/annotated-transformer/blob/master/LICENSE

class SublayerConnection(torch.nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. This is different than the original transformer described in "Attention Is All You Need". They use self.norm(x + self.dropout(sublayer(x))). Not sure why that's more complicated. Regardless, since this is how Calypso/Annotated Transformer do it, looks good...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's simply that they apply the norm later. Not that it's left out.

self.d_k = input_dim // num_heads
self.num_heads = num_heads
self.linears = util.clone(torch.nn.Linear(input_dim, input_dim), 4)
self.attn = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears to be dead code.

# We assume d_v always equals d_k
self.d_k = input_dim // num_heads
self.num_heads = num_heads
self.linears = util.clone(torch.nn.Linear(input_dim, input_dim), 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit scary to have these stored as a list because fundamentally they're heterogeneous. The first three layers or for projecting the query, key and value respectively and the final layer is for projecting the concatenated heads. Since you're just porting I don't want to request any big changes, but a warning comment here would be great.

if input_dropout:
self._dropout = torch.nn.Dropout(input_dropout)
else:
self._dropout = lambda x: x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification, have we found there to be substantial overhead to torch.nn.Dropout(0.0)?

Copy link
Contributor
@brendan-ai2 brendan-ai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only critical comment is that licensing one. The rest are minor/for my understanding. Thanks for the PR, Joel!

@brendan-ai2
Copy link
Contributor

Per Matt's suggestion on Slack, we should probably mark this as AllenNLP internal somehow. In the hopes of consolidating to one transformer implementation soon.

Also, it would be great to get this merged before the break if you have time. :)

@joelgrus
Copy link
Contributor Author

comments addressed, sorry about the delay

@joelgrus joelgrus merged commit 6b16222 into allenai:master Nov 21, 2018
@brendan-ai2
Copy link
Contributor

Thanks 77B8 , Joel! No worries at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0