8000 allow TransformerTextField to take input directly from HF tokenizer by epwalsh · Pull Request #5329 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

allow TransformerTextField to take input directly from HF tokenizer #5329

Merged
merged 2 commits into from
Jul 23, 2021

Conversation

epwalsh
Copy link
Member
@epwalsh epwalsh commented Jul 22, 2021

Currently you can't do

TransformerTextField(**tokenizer("Hello, World!", return_tensors="pt"))

because the tensors returned from the tokenizer have a batch dimension. This fixes that.

@epwalsh epwalsh requested a review from dirkgr July 22, 2021 22:20
Copy link
Member
@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this, but there is a risk here. When you ask the HuggingFace tokenizer to make tensors, it also wants to add padding. We can't have padding in TransformerTextField.

It's really annoying that HF tokenizers are basically all-inclusive preprocessing machines. They tokenize, they index, they batch, they concatenate, all in one. If you want to do only one of these things, tough luck.

@epwalsh epwalsh merged commit 42d8529 into main Jul 23, 2021
@epwalsh epwalsh deleted the transformer-text-field-fix branch July 23, 2021 00:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0