8000 Confusing `Vocabulary` docstring · Issue #5564 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.
This repository was archived by the owner on Dec 16, 2022. It is now read-only.
Confusing Vocabulary docstring #5564
Closed
@david-waterworth

Description

@david-waterworth

I'm trying to build a model using pre-trained gensim word2vec vectors. I think I have it working now but it was quite confusing, in particular, the docstring for the Vocabulary only_include_pretrained_words. It states:

only_include_pretrained_words : `bool`, optional (default=`False`)
    This defines the strategy for using any pretrained embedding files which may have been
    specified in `pretrained_files`. If False, an inclusive strategy is used: and words
    which are in the `counter` and in the pretrained file are added to the `Vocabulary`,
    regardless of whether their count exceeds `min_count` or not. If True, we use an
    exclusive strategy: words are only included in the Vocabulary if they are in the pretrained
    embedding file (their count must still be at least `min_count`).

What this implies to me is that you either get the intersection of the pre-trained vocab and the tokens discovered from the instances OR you get just the pre-trained vocabulary. However, there's what an undocumented and confusing 60C4 (IMO) interaction with the min_pretrained_embeddings parameter, the docstring being:

min_pretrained_embeddings : `Dict[str, int]`, optional
    If provided, specifies for each namespace a minimum number of lines (typically the
    most common words) to keep from pretrained embedding files, even for words not
    appearing in the data.

Given that it's optional, my expectation is that if you set only_include_pretrained_words=True and don't set min_pretrained_embeddings the resulting vocabulary should be 100% of the tokens from the pre-trained file.

This isn't the case from what I gather, it appears you must set only_include_pretrained_words=True and pass the number of tokens in the pre-trained files anmespace to min_pretrained_embeddings due to this line :

min_embeddings = min_pretrained_embeddings.get(namespace, 0)

It coalesces 'min_pretrained_embeddings' to rather than the length of pretrained_list, so you always only get the vocab from instances. Perhaps it should be:

            pretrained_list = _read_pretrained_tokens(pretrained_files[namespace])
            min_embeddings = min_pretrained_embeddings.get(namespace, len(pretrained_list))

Although I'm not sure if that also works in the case where only_include_pretrained_words=False (or at least I'm not sure it's consistent with the docstring)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0