-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add necessary implicit embedding extension for transfer-modules api and vocab extension #2431
Add necessary implicit embedding extension for transfer-modules api and vocab extension #2431
Conversation
…-embedding-extension-load
…-embedding-extension-load
Okay, this is up for review. @joelgrus, @matt-gardner Two main things changed here:
Other change: we need to make sure embedding-extension is no-op unless we are sure (eg. it's incorrect to default to "tokens" namespace). Incorrect implicit assumption with above changes can make some tests fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor questions; otherwise looks very good, thanks for the PR!
logging.warning("No vocab_namespace provided to Embedder.extend_vocab. Defaulting to 'tokens'.") | ||
# It's not safe to default to 'tokens' when we aren't sure that 'tokens' | ||
# need to be extended. (Without this, several tests fail.) | ||
logging.warning("No vocab_namespace provided to Embedder.extend_vocab. Extension will be no-op'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what circumstances will this actually emit a warning? Will almost everyone that loads or trains a model in practice see this warning? If so, it should be at the info
level (or even debug
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning won't be seen for any models trained after #2374 because _vocab_namespace
is stored. For previously trained models, it would almost always be seen because extend_vocab
call is implicit now. Will change it to info
level.
return | ||
|
||
extended_num_embeddings = extended_vocab.get_vocab_size(vocab_namespace) | ||
if extended_num_embeddings <= self.num_embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ==
, not <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if vocab and embedding are already in sync it should be ==
. I had kept <=
as a precaution against incorrect vocab namespace. But on second thought, if user passed an incorrect vocab namespace, it's better to raise error than a silent no-op. Will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth explicitly raising error for <
case? It's only possible with user explicitly passed an incorrect vocab namespace that's smaller than embedding itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have separated ==
and <
cases, making no-op in first and raising configuration error in second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-gardner Correcting this revealed a subtle issue: defaulting to tokens
and token_characters
namepsace can be problematic, when num_embeddings
was used instead of vocab namespace to decide embedding size. Fixed this in last commit.
This is up for another look now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. oops, didn't realize you already reviewed again!
…shtrivedi/allennlp into fix-embedding-extension-load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor wording tweaks, and this is good to merge. Thanks for the PR, for this and all of the related functionality! I think it's turned out quite nicely.
logging.warning("No vocab_namespace provided to Embedder.extend_vocab. Defaulting to 'tokens'.") | ||
# It's not safe to default to 'tokens' when we aren't sure that 'tokens' | ||
# need to be extended. (Without this, several tests fail.) | ||
logging.info("No vocab_namespace provided to Embedder.extend_vocab. Extension will be no-op'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this more obvious, I'd recommend a message like "Loading a model trained before embedding extension was implemented; pass an explicit vocab namespace if you want to extend the vocabulary."
vocab_namespace = "tokens" | ||
logging.warning("No vocab_namespace provided to Embedder.extend_vocab. Defaulting to 'tokens'.") | ||
# It's not safe to default to 'tokens' when we aren't sure that 'tokens' | ||
# need to be extended. (Without this, several tests fail.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to reference failing tests in comments in the code - just give the justification (that it's not safe to default to "tokens").
@matt-gardner, Thanks for the review! This should be good to merge now. |
Please don't review this yet; I still need to change some code and add tests. This is a follow-up on #2374 and #2387, but I will probably get back after #2395 gets finalized.