-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Allow embedding extension to load from pre-trained embeddings file. #2387
Allow embedding extension to load from pre-trained embeddings file. #2387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Just a few minor things, then this is good to merge.
def extend_vocab(self, # pylint: disable=arguments-differ | ||
extended_vocab: Vocabulary, | ||
vocab_namespace: str = None, | ||
pretrained_file: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate the return type here ( -> None
)?
""" | ||
Extends the embedding matrix according to the extended vocabulary. | ||
Extended weight would be initialized with xavier uniform. | ||
If pretrained_file is available, it will be used for extented weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording tweak: "If pretrained_file is available, it will be used for initializing the new words in the extended vocabulary; otherwise they will be initialized with xavier uniform."
extra_weight = torch.FloatTensor(extra_num_embeddings, embedding_dim) | ||
torch.nn.init.xavier_uniform_(extra_weight) | ||
else: | ||
whole_weight = _read_pretrained_embeddings_file(pretrained_file, embedding_dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not what I expected, but it should work. I was imagining some kind of loop over the file just looking for the new vocab entries. But, this is much simpler logic, and the computation saved by the more complex logic is probably negligible. This could at least use a comment, I think, like # It's easiest to just reload the embeddings for the entire vocab, then only keep the ones we need.
def extend_vocab(self, # pylint: disable=arguments-differ | ||
extended_vocab: Vocabulary, | ||
vocab_namespace: str = "token_characters", | ||
pretrained_file: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same two comments on this class (return type annotation and docstring wording tweak).
|
||
assert tuple(original_weight.size()) == (4, 3) # 4 because of padding and OOV | ||
|
||
vocab.add_token_to_namespace('word3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the embeddings for the original vocab here (simulating training), just as an additional sanity check that they don't get modified by extend_vocab()
with a pre-trained file?
This should be good to merge now. |
@matt-gardner If this looks good now, can you merge this? I am waiting to open a follow-up on this. |
Looks good, thanks! |
This is a follow-up to #2374. @matt-gardner, after #2374 is merged, can you please review this?