Vocabulary
docstring #5564Description
I'm trying to build a model using pre-trained gensim word2vec vectors. I think I have it working now but it was quite confusing, in particular, the docstring for the Vocabulary
only_include_pretrained_words
. It states:
only_include_pretrained_words : `bool`, optional (default=`False`)
This defines the strategy for using any pretrained embedding files which may have been
specified in `pretrained_files`. If False, an inclusive strategy is used: and words
which are in the `counter` and in the pretrained file are added to the `Vocabulary`,
regardless of whether their count exceeds `min_count` or not. If True, we use an
exclusive strategy: words are only included in the Vocabulary if they are in the pretrained
embedding file (their count must still be at least `min_count`).
What this implies to me is that you either get the intersection of the pre-trained vocab and the tokens discovered from the instances OR you get just the pre-trained vocabulary. However, there's what an undocumented and confusing
60C4
(IMO) interaction with the min_pretrained_embeddings
parameter, the docstring being:
min_pretrained_embeddings : `Dict[str, int]`, optional
If provided, specifies for each namespace a minimum number of lines (typically the
most common words) to keep from pretrained embedding files, even for words not
appearing in the data.
Given that it's optional, my expectation is that if you set only_include_pretrained_words=True
and don't set min_pretrained_embeddings
the resulting vocabulary should be 100% of the tokens from the pre-trained file.
This isn't the case from what I gather, it appears you must set only_include_pretrained_words=True
and pass the number of tokens in the pre-trained files anmespace to min_pretrained_embeddings
due to this line :
allennlp/allennlp/data/vocabulary.py
Line 669 in 1caf0da
It coalesces 'min_pretrained_embeddings' to rather than the length of pretrained_list, so you always only get the vocab from instances. Perhaps it should be:
pretrained_list = _read_pretrained_tokens(pretrained_files[namespace])
min_embeddings = min_pretrained_embeddings.get(namespace, len(pretrained_list))
Although I'm not sure if that also works in the case where only_include_pretrained_words=False
(or at least I'm not sure it's consistent with the docstring)