Add `TokenClassificationEvaluator` #167

fxmarty · 2022-06-29T14:59:45Z

This PR adds a TokenClassificationEvaluator, and refactors the evaluator.py into an evaluator/ subdirectory to avoid too long files, following https://github.com/huggingface/transformers/tree/main/src/transformers/pipelines .

There are no changes to the Evaluator base class and TextClassificationEvaluator, just refactoring.

This evaluator will fail for the following cases:

models not splitting tokens by space (e.g. where "he oh" can be a token).
datasets as https://huggingface.co/datasets/msra_ner , where "words" are provided ideogram by ideogram, and where a tokenizer may map several ideograms to a single token (e.g. https://huggingface.co/abhishek/autonlp-japanese-sentiment-59363/blob/main/vocab.txt#L23394 ). See the following discussion: Match the indexing output of TokenClassificationPipeline with labels (e.g. ner_tags) transformers#17139
datasets not providing the input and reference columns as a list of "words" (i.e., in the conll2003 fashion: https://huggingface.co/datasets/conll2003)

I am very much happy if you have propositions on how to improve this. Basically, most datasets on the Hub have the input column split by word, while the TokenClassificationPipeline expects a single string as input (no is_split_into_words option).

The hack I use as a workaround is to join all words into a single string, by default joining with " ".join(data["tokens"]) where data["tokens"] is something like ["I", "am", "an", "example", "."].

The problem with this approach is that the BatchEncoding.word_ids from the tokenizer output may yield word indexes that do not match the true word indexes, hence making the mapping with the references ner_tags impossible. For example, tokenizing Germany 's representative to the European Union with pipe.preprocess(), we may have ' and s having different word ids as in the issue I linked above.

The solution chosen is to tokenize each word (i.e. each item from the dataset input column) separately, in order to retrieve the index of the first token for all words. This approach, as stated above, may well break for languages as Chinese/Japanese where a token may be made of several words (in the sense of several items from the dataset input column, e.g. in https://huggingface.co/datasets/msra_ner ).

HuggingFaceDocBuilderDev · 2022-06-29T15:33:44Z

The documentation is not available anymore as the PR was closed or merged.

ola13 · 2022-06-30T08:22:00Z

Thanks for this @fxmarty! Indeed tokenization of languages which don't break on spaces is a challenging task of its own, it comes up in Chinese, Japanese, Korean but also Thai and to some extent Arabic (also probably many other languages). Do you know how this issue is handled by models which do token classification on these specific languages? Do we have examples of such models? I think it might be helpful to see how it's typically handled before we decide on a solution.

src/evaluate/evaluator/token_classification.py

lvwerra · 2022-06-30T12:47:38Z

Hi @fxmarty, thanks so much for working on this. To make it a bit easier to review this PR and discuss options could you provide a few minimal examples for each case where the current evaluator works and does not work?

In general it is ok if the user needs to preprocess the data into the right format but it would be also great if some of the most popular formats kind of worked out of the box (e.g. people will try conll).

Maybe the NER evaluator should have a config format to distinguish:

offset: the inputs are provided as single string and the labels as a list of offset/span tuples.
IOB: the inputs are list of words and the labels as well

The first format shouldn't pose an issue with the pipeline, right? For the second case can't we use the tokenizer similar to how it's done in the transformers NER example? After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

For the cases where this fails we could expect the user to be able to transform the data into any of those two formats and it should also work. We can document how to use the evaluator with other formats in the documentation.

What do you think?

fxmarty · 2022-06-30T14:32:13Z

Thank you for your feedback!

Do you know how this issue is handled by models which do token classification on these specific languages? Do we have examples of such models? I think it might be helpful to see how it's typically handled before we decide on a solution.

So from what I looked up, I found some models for POS tagging in Japanese/Chinese, as an example https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos trained on https://huggingface.co/datasets/universal_dependencies . It's actually not an issue because the tokenizer has no multi-ideogram tokens. See

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline

rawdata = load_dataset("universal_dependencies", "ja_modern")

tokenizer = AutoTokenizer.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")
model = AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")

print(rawdata["test"][1])

print("----------")

inp = rawdata["test"][1]["tokens"]
encoded = tokenizer.encode(inp, return_tensors="pt", is_split_into_words=True)
print(encoded)

inp = rawdata["test"][1]["text"]
encoded = tokenizer.encode(inp, return_tensors="pt")
print(encoded)

For other datasets, (e.g. https://huggingface.co/datasets/xtreme/ or https://huggingface.co/datasets/wikiann/) there seem to be no models on the Hub for the Chinese/Japanese splits. I guess there are plently of models there https://github.com/PaddlePaddle/PaddleNLP though.

offset: the inputs are provided as single string and the labels as a list of offset/span tuples.

Looking at the datasets for token-classification on the Hub, all of them on the first page have inputs as list of words and labels as well. Do you know of a popular dataset where labels are a list of offset/span tuples? I mentioned this case not being support but I could actually not find a relevant dataset for it. Maybe we don't need to support this case?

For the second case can't we use the tokenizer similar to how it's done in the transformers NER example?

Yes, this is probably the simpliest actually. Avoid the pipeline.preprocess() and use instead something similar to https://github.com/huggingface/transformers/blob/4f8361afe7b411ae2956d59a761264eef8db6ad8/examples/pytorch/token-classification/run_ner.py#L419-L453 . Then, use pipeline.forward() and pipeline.postprocess().

After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

I could be mistaken, but I don't think so (see huggingface/transformers#16438 (comment) ). It probably does not matter if we use the workflow I suggest above.

The only case left which could be an issue is if the model uses a tokenizer that may consider as a token several ideograms (e.g. 日本 can be a token), while the input data is a list of ideograms. In this case, we may have bad performance if we use a tokenizer with the is_split_into_words=True (or the pipeline.preprocess() on each word I initially proposed), as I understand no multi-ideogram token will be generated. For example, if the input is ["日", "本", "語", "じ", "ょ" "う", "ず"], 日本 will not be treated as a token. AFAIK there are no models on the Hub for such a case with token classification though.

What do you think?

lvwerra · 2022-07-01T09:16:20Z

Do you know of a popular dataset where labels are a list of offset/span tuples?

I think most labeling tools use this format for the exports - so it would be nice if we could add this format at some point. If it requires a ton of work we can address this in a subsequent PR.

Yes, this is probably the simpliest actually. Avoid the pipeline.preprocess() and use instead something similar to https://github.com/huggingface/transformers/blob/4f8361afe7b411ae2956d59a761264eef8db6ad8/examples/pytorch/token-classification/run_ner.py#L419-L453 . Then, use pipeline.forward() and pipeline.postprocess().

If possible I would like to avoid calling pipeline methods explicitly (except __call__ obviously). The reason we chose the pipeline is that it is quite a generic abstraction that can be easily adapted to other frameworks. If we call methods like forward or postprocess this will require somebody implementing another framework to do the same. Therefore it would be great if we could keep it as generic as possible.

See here: https://huggingface.co/docs/evaluate/main/en/custom_evaluator

AFAIK there are no models on the Hub for such a case with token classification though.

I don't think we need to worry about these cases too much now. Let's make the limitations are well documented.

fxmarty · 2022-07-01T11:59:15Z

Understood. Although I think it is not too critical to call preprocess, forward, postprocess as they should always be available with pipelines: https://huggingface.co/docs/transformers/v4.20.1/en/add_new_pipeline . What is critical is that I assumed that the output of preprocess is always a transformers.BatchEncoding, which is not a standard with pipelines. So this is not good.

After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

So I thought more about this later, and contrary to what I said above, I think what you suggest is very fine. The message I linked just stated that you can not retrieve the exact input text with .decode(), but actually I believe the decoded text will give the very same output if run again into the tokenizer. So given your suggestions, I propose the following workflow (self-contained script just to give an idea with some printing):

from transformers import pipeline
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

data = load_dataset("conll2003")

model_name = "elastic/distilbert-base-uncased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

kwargs = {"ignore_labels": []}
pipe = pipeline(task="token-classification", model=model, tokenizer=tokenizer, **kwargs)

inp = data["validation"][15]

print("Original input:", inp["tokens"])
print("Length original input:", len(inp["tokens"]))
tokenized_inp = tokenizer(inp["tokens"], is_split_into_words=True)

print("Length tokenized input:", len(tokenized_inp["input_ids"]))

word_ids = tokenized_inp.word_ids(0)
print("word_ids:", word_ids)

# the pipeline gives no output for tokens with `None` word ids
word_ids_no_none = [word_id for word_id in word_ids if word_id != None]

# the pipeline may give as output labeled tokens that are part of the same word, keep track
# of the indexing to match the true labels on words
index_tokens_word_start = []

for j, word_index in enumerate(word_ids_no_none):
    if j == 0:
        index_tokens_word_start.append(j)
    elif word_index != word_ids_no_none[j - 1]:
        index_tokens_word_start.append(j)

print("Length index_tokens_word_start:", len(index_tokens_word_start))

# we remove tokens corresponding to `None` word ids, i.e. tokens added by the tokenizer since
# we will tokenize again later on
token_indexes = [i for i in range(len(word_ids)) if word_ids[i] != None]

to_decode = [tokenized_inp["input_ids"][i] for i in token_indexes]

decoded = tokenizer.decode(to_decode)
print("Input to pipeline:", decoded)

tokenized_decoded = tokenizer(decoded)

assert tokenized_decoded["input_ids"] == tokenized_inp["input_ids"]

res = pipe(decoded)

print("Length pipeline output:", len(res))

printing

Original input: ['By', 'stumps', 'Kent', 'had', 'reached', '108', 'for', 'three', '.']
Length original input: 9
Length tokenized input: 12
word_ids: [None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, None]
Length index_tokens_word_start: 9
Input to pipeline: by stumps kent had reached 108 for three.
Length pipeline output: 10

What do you think @lvwerra ? Here we do not call any methods from Pipeline explicitly.

To get the labels, we keep the same logic as in this PR, i.e. https://github.com/fxmarty/evaluate/blob/6ebf2bcaf3f8a8de2f027b122652681a3b3ac60b/src/evaluate/evaluator/token_classification.py#L212-L237

src/evaluate/evaluator/token_classification.py

lvwerra

Awesome work - just a few minor comments. To be sure everything works as expected you could run the parity test on a few datasets and on larger subsets.

PS: The pipeline_call should now also return the performance metrics and thus you need to alsoupdate the tests since the performance metrics are then also in the returned dict (aa81085).

src/evaluate/evaluator/token_classification.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

lvwerra

Hi @fxmarty, thanks for the changes - just a few nits. Did you run the parity tests on 2-3 datasets with larger sample size as well as checking the metric matches what some models report? Just to double check that it not only works correct on conll.

src/evaluate/evaluator/token_classification.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

lvwerra

F438

Just added three suggestions to make it compatible with the now merged device placement.

src/evaluate/evaluator/token_classification.py

fxmarty · 2022-07-20T16:43:33Z

Thanks a lot for the review!

I ran the parity test as well on philschmid/distilroberta-base-ner-wikiann with https://huggingface.co/datasets/wikiann/viewer/en/validation , we do match.

I wanted to run sachaarbonel/bert-italian-cased-finetuned-pos with https://huggingface.co/datasets/xtreme/viewer/udpos.Italian/test , but it is more tricky as there is no run_pos.py script to compare to and this requires some changes to run_ner.py.

Looking for models to try out, I realized many have bad label2id mapping (like "LABEL_0", etc.).

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

ola13 reviewed Jun 30, 2022

View reviewed changes

src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved

lvwerra self-requested a review July 1, 2022 08:27

fxmarty mentioned this pull request Jul 1, 2022

Add a test to check that Evaluator evaluations match transformers examples #163

Merged

fxmarty mentioned this pull request Jul 4, 2022

Add ImageClassificationEvaluator #173

Merged

fxmarty added 7 commits July 18, 2022 10:12

added token-classification evaluator

1103674

refactor

4fdcbbf

style

50906f1

renaming

1e748a0

renaming

0b4d295

tempo

56e8baa

rebased

d8a5d88

fxmarty force-pushed the add-token-classification-evaluator branch from 6ebf2bc to d8a5d88 Compare July 18, 2022 08:23

fxmarty added 10 commits July 18, 2022 10:29

added test back

8e3afa6

new version

02e955f

add parity test

930c9bb

better doc

5a5b0e9

style

22b8c28

better doc

b34cb05

better doc

f41949a

yet again better doc

2740fcf

fix doc

80d4859

fix doc

fbfbec7

fxmarty force-pushed the add-token-classification-evaluator branch from 5a3cb14 to fbfbec7 Compare July 18, 2022 20:50

fxmarty commented Jul 18, 2022

View reviewed changes

src/evaluate/evaluator/token_classification.py Show resolved Hide resolved

trigger build

a6d0001

lvwerra reviewed Jul 19, 2022

View reviewed changes

fxmarty and others added 6 commits July 19, 2022 18:56

merged

f00600d

fix with qa merged

24349f8

Update src/evaluate/evaluator/token_classification.py

3750408

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

better doc

60b7a15

Merge branch 'main' into add-token-classification-evaluator

791986c

fix

9485c8d

lvwerra approved these changes Jul 20, 2022

View reviewed changes

fxmarty and others added 3 commits July 20, 2022 17:51

Update src/evaluate/evaluator/token_classification.py

797919a

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Update src/evaluate/evaluator/token_classification.py

bc1081c

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

better doc

90e08e6

lvwerra reviewed Jul 20, 2022

View reviewed changes

src/evaluate/evaluator/token_classification.py Show resolved Hide resolved

src/evaluate/evaluator/token_classification.py Show resolved Hide resolved

src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved

fxmarty and others added 5 commits July 20, 2022 18:48

Update src/evaluate/evaluator/token_classification.py

7d81c46

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Update src/evaluate/evaluator/token_classification.py

69cee88

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Update src/evaluate/evaluator/token_classification.py

42ec88f

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Merge branch 'main' into add-token-classification-evaluator

9f40462

hopefully we pass the tests this time

728b81a

lvwerra merged commit 37bd06c into huggingface:main Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `TokenClassificationEvaluator` #167

Add `TokenClassificationEvaluator` #167

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add TokenClassificationEvaluator #167

Add TokenClassificationEvaluator #167

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add `TokenClassificationEvaluator` #167

Add `TokenClassificationEvaluator` #167