8000 Allow embedding extension to load from pre-trained embeddings file. by HarshTrivedi · Pull Request #2387 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Allow embedding extension to load from pre-trained embeddings file. #2387

Merged

Conversation

HarshTrivedi
Copy link
Contributor

This is a follow-up to #2374. @matt-gardner, after #2374 is merged, can you please review this?

Copy link
Contributor
@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Just a few minor things, then this is good to merge.

def extend_vocab(self, # pylint: disable=arguments-differ
extended_vocab: Vocabulary,
vocab_namespace: str = None,
pretrained_file: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate the return type here ( -> None)?

"""
Extends the embedding matrix according to the extended vocabulary.
Extended weight would be initialized with xavier uniform.
If pretrained_file is available, it will be used for extented weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording tweak: "If pretrained_file is available, it will be used for initializing the new words in the extended vocabulary; otherwise they will be initialized with xavier uniform."

extra_weight = torch.FloatTensor(extra_num_embeddings, embedding_dim)
torch.nn.init.xavier_uniform_(extra_weight)
else:
whole_weight = _read_pretrained_embeddings_file(pretrained_file, embedding_dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not what I expected, but it should work. I was imagining some kind of loop over the file just looking for the new vocab entries. But, this is much simpler logic, and the computation saved by the more complex logic is probably negligible. This could at least use a comment, I think, like # It's easiest to just reload the embeddings for the entire vocab, then only keep the ones we need.

def extend_vocab(self, # pylint: disable=arguments-differ
extended_vocab: Vocabulary,
vocab_namespace: str = "token_characters",
pretrained_file: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same two comments on this class (return type annotation and docstring wording tweak).


assert tuple(original_weight.size()) == (4, 3) # 4 because of padding and OOV

vocab.add_token_to_namespace('word3')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the embeddings for the original vocab here (simulating training), just as an additional sanity check that they don't get modified by extend_vocab() with a pre-trained file?

@HarshTrivedi
Copy link
Contributor Author

This should be good to merge now.

@HarshTrivedi
Copy link
Contributor Author

@matt-gardner If this looks good now, can you merge this? I am waiting to open a follow-up on this.

@matt-gardner
Copy link
Contributor

Looks good, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0