-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
Process question: Given that this is mostly raw Calypso code, should this be more on the stamp end of the code review spectrum or should I be more detailed? |
if you see something that you think could be improved, then you should point it out. but if you have questions like "why did you implement X this way?" the answer is likely going to be "that's how it was implemented in calypso and I just copied the code over" 🤷♀️ |
@@ -0,0 +1,265 @@ | |||
""" | |||
The BidirectionalTransformerEncoder from calypso |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should link the annotated transformer tutorial (http://nlp.seas.harvard.edu/2018/04/03/attention.html) here as this implementation appears to be based on that. MIT license conveniently. https://github.com/harvardnlp/annotated-transformer/blob/master/LICENSE
class SublayerConnection(torch.nn.Module): | ||
""" | ||
A residual connection followed by a layer norm. | ||
Note for code simplicity the norm is first as opposed to last. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. This is different than the original transformer described in "Attention Is All You Need". They use self.norm(x + self.dropout(sublayer(x)))
. Not sure why that's more complicated. Regardless, since this is how Calypso/Annotated Transformer do it, looks good...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's simply that they apply the norm later. Not that it's left out.
self.d_k = input_dim // num_heads | ||
self.num_heads = num_heads | ||
self.linears = util.clone(torch.nn.Linear(input_dim, input_dim), 4) | ||
self.attn = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears to be dead code.
# We assume d_v always equals d_k | ||
self.d_k = input_dim // num_heads | ||
self.num_heads = num_heads | ||
self.linears = util.clone(torch.nn.Linear(input_dim, input_dim), 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit scary to have these stored as a list because fundamentally they're heterogeneous. The first three layers or for projecting the query, key and value respectively and the final layer is for projecting the concatenated heads. Since you're just porting I don't want to request any big changes, but a warning comment here would be great.
if input_dropout: | ||
self._dropout = torch.nn.Dropout(input_dropout) | ||
else: | ||
self._dropout = lambda x: x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my edification, have we found there to be substantial overhead to torch.nn.Dropout(0.0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only critical comment is that licensing one. The rest are minor/for my understanding. Thanks for the PR, Joel!
Per Matt's suggestion on Slack, we should probably mark this as AllenNLP internal somehow. In the hopes of consolidating to one transformer implementation soon. Also, it would be great to get this merged before the break if you have time. :) |
comments addressed, sorry about the delay |
Thanks 77B8 , Joel! No worries at all. |
there's still some potential duplicate code, but it's not exact duplication, so it would take some work to factor out the common parts and I don't know that that's a great use of anyone's time.
for the most part this is just a straight port + adding type annotations and minor cleanup