Allow embedding extension to load from pre-trained embeddings file. #2387

HarshTrivedi · 2019-01-18T02:53:20Z

This is a follow-up to #2374. @matt-gardner, after #2374 is merged, can you please review this?

…allennlp into embedder-extension

…allennlp into pretrained-file-embedder-extension-1

…trained-file-embedder-extension-1

matt-gardner

Looks great, thanks! Just a few minor things, then this is good to merge.

matt-gardner · 2019-01-18T14:44:30Z

allennlp/modules/token_embedders/embedding.py

+    def extend_vocab(self,  # pylint: disable=arguments-differ
+                     extended_vocab: Vocabulary,
+                     vocab_namespace: str = None,
+                     pretrained_file: str = None):


Can you annotate the return type here ( -> None)?

matt-gardner · 2019-01-18T14:45:41Z

allennlp/modules/token_embedders/embedding.py

        """
        Extends the embedding matrix according to the extended vocabulary.
-        Extended weight would be initialized with xavier uniform.
+        If pretrained_file is available, it will be used for extented weight


Wording tweak: "If pretrained_file is available, it will be used for initializing the new words in the extended vocabulary; otherwise they will be initialized with xavier uniform."

matt-gardner · 2019-01-18T14:52:22Z

allennlp/modules/token_embedders/embedding.py

+            extra_weight = torch.FloatTensor(extra_num_embeddings, embedding_dim)
+            torch.nn.init.xavier_uniform_(extra_weight)
+        else:
+            whole_weight = _read_pretrained_embeddings_file(pretrained_file, embedding_dim,


This was not what I expected, but it should work. I was imagining some kind of loop over the file just looking for the new vocab entries. But, this is much simpler logic, and the computation saved by the more complex logic is probably negligible. This could at least use a comment, I think, like # It's easiest to just reload the embeddings for the entire vocab, then only keep the ones we need.

matt-gardner · 2019-01-18T14:54:46Z

allennlp/modules/token_embedders/token_characters_encoder.py

+    def extend_vocab(self,  # pylint: disable=arguments-differ
+                     extended_vocab: Vocabulary,
+                     vocab_namespace: str = "token_characters",
+                     pretrained_file: str = None):


Same two comments on this class (return type annotation and docstring wording tweak).

matt-gardner · 2019-01-18T14:56:58Z

allennlp/tests/modules/token_embedders/embedding_test.py

+
+        assert tuple(original_weight.size()) == (4, 3)  # 4 because of padding and OOV
+
+        vocab.add_token_to_namespace('word3')


Can you change the embeddings for the original vocab here (simulating training), just as an additional sanity check that they don't get modified by extend_vocab() with a pre-trained file?

HarshTrivedi · 2019-01-18T16:50:32Z

This should be good to merge now.

HarshTrivedi · 2019-01-18T18:46:21Z

@matt-gardner If this looks good now, can you merge this? I am waiting to open a follow-up on this.

matt-gardner · 2019-01-18T19:36:40Z

Looks good, thanks!

HarshTrivedi added 30 commits January 15, 2019 15:37

Rough attempt for Embedder/Embedding extension.

d133cfe

fix some mistakes.

8285373

add tests for token-embedder and text-field-embedder extension.

879db96

fix vocab_namespace usage in embedding.py

3f40713

update names and change some comments.

2da9efa

update embedding tests.

c7ef877

fix some typos.

1b3c67d

add more tests.

84672a3

update some comments.

4520aff

fix minor pylint issue.

506f9d5

Implement extend_vocab for TokenCharactersEncoder.

c037e8a

minor simplification.

f1da7b0

Update help text for --extend-vocab in fine-tune command.

514e15d

Shift location of model tests appropriately.

8f0cc4f

Allow root Embedding in model to be extendable.

ee8d160

Incorporate PR comments in embedding.py.

f6612ec

Fix annotations.

885949f

Add appropriate docstrings and minor cleanup.

4d0a66a

Resolve pylint complains.

1b8ee05

shift disable pytlint protected-access to top of tests.

b027f7c

Add a test to increase coverage.

35222f7

minor update in TokenEmbedder docstrings.

996bdeb

Allow to pass pretrained_file in embedding extension (with tests).

71358d1

Merge branch 'master' into embedder-extension

7492729

Remove a blank line.

fd089ea

Merge branch 'master' into embedder-extension

11d9b44

Add a blank line before Returns block in Embedding docstring.

4358f36

Merge branch 'embedder-extension' of https://github.com/harshtrivedi/…

31973e1

…allennlp into embedder-extension

Merge branch 'embedder-extension' of https://github.com/harshtrivedi/…

708f87c

…allennlp into pretrained-file-embedder-extension-1

Fix pylint complains.

928b411

HarshTrivedi added 3 commits January 17, 2019 23:34

Merge branch 'master' of https://github.com/allenai/allennlp into pre…

feb85dd

…trained-file-embedder-extension-1

Allow pretrained file to be passed in token_characters_encoder also.

fd34e67

Fix pylint complains and update some comments.

6b8c207

matt-gardner approved these changes Jan 18, 2019

View reviewed changes

HarshTrivedi added 2 commits January 18, 2019 11:02

Test to ensure trained embeddings do not get overriden.

9f20b1e

PR feedback: update comments, fix annotation.

b66f230

matt-gardner merged commit 9719b5c into allenai:master Jan 18, 2019

This was referenced Jan 18, 2019

Add support for pretrained embedding extension in fine-tuning. #2395

Merged

Add necessary implicit embedding extension for transfer-modules api and vocab extension #2431

Merged

HarshTrivedi deleted the pretrained-file-embedder-extension-1 branch February 3, 2019 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow embedding extension to load from pre-trained embeddings file. #2387

Allow embedding extension to load from pre-trained embeddings file. #2387

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		assert tuple(original_weight.size()) == (4, 3) # 4 because of padding and OOV

		vocab.add_token_to_namespace('word3')

Allow embedding extension to load from pre-trained embeddings file. #2387

Allow embedding extension to load from pre-trained embeddings file. #2387

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!