Confusing Vocabulary docstring

I'm trying to build a model using pre-trained gensim word2vec vectors. I think I have it working now but it was quite confusing, in particular, the docstring for the Vocabulary only_include_pretrained_words. It states:

only_include_pretrained_words : `bool`, optional (default=`False`)
    This defines the strategy for using any pretrained embedding files which may have been
    specified in `pretrained_files`. If False, an inclusive strategy is used: and words
    which are in the `counter` and in the pretrained file are added to the `Vocabulary`,
    regardless of whether their count exceeds `min_count` or not. If True, we use an
    exclusive strategy: words are only included in the Vocabulary if they are in the pretrained
    embedding file (their count must still be at least `min_count`).

What this implies to me is that you either get the intersection of the pre-trained vocab and the tokens discovered from the instances OR you get just the pre-trained vocabulary. However, there's what an undocumented and confusing 60C4 (IMO) interaction with the min_pretrained_embeddings parameter, the docstring being:

min_pretrained_embeddings : `Dict[str, int]`, optional
    If provided, specifies for each namespace a minimum number of lines (typically the
    most common words) to keep from pretrained embedding files, even for words not
    appearing in the data.

Given that it's optional, my expectation is that if you set only_include_pretrained_words=True and don't set min_pretrained_embeddings the resulting vocabulary should be 100% of the tokens from the pre-trained file.

This isn't the case from what I gather, it appears you must set only_include_pretrained_words=True and pass the number of tokens in the pre-trained files anmespace to min_pretrained_embeddings due to this line :

allennlp/allennlp/data/vocabulary.py

Line 669 in 1caf0da

min_embeddings = min_pretrained_embeddings.get(namespace, 0)

It coalesces 'min_pretrained_embeddings' to rather than the length of pretrained_list, so you always only get the vocab from instances. Perhaps it should be:

            pretrained_list = _read_pretrained_tokens(pretrained_files[namespace])
            min_embeddings = min_pretrained_embeddings.get(namespace, len(pretrained_list))

Although I'm not sure if that also works in the case where only_include_pretrained_words=False (or at least I'm not sure it's consistent with the docstring)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions