-
Notifications
You must be signed in to change notification settings - Fork 277
Add TokenClassificationEvaluator
#167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TokenClassificationEvaluator
#167
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thanks for this @fxmarty! Indeed tokenization of languages which don't break on spaces is a challenging task of its own, it comes up in Chinese, Japanese, Korean but also Thai and to some extent Arabic (also probably many other languages). Do you know how this issue is handled by models which do token classification on these specific languages? Do we have examples of such models? I think it might be helpful to see how it's typically handled before we decide on a solution. |
Hi @fxmarty, thanks so much for working on this. To make it a bit easier to review this PR and discuss options could you provide a few minimal examples for each case where the current evaluator works and does not work? In general it is ok if the user needs to preprocess the data into the right format but it would be also great if some of the most popular formats kind of worked out of the box (e.g. people will try conll). Maybe the NER evaluator should have a config
The first format shouldn't pose an issue with the pipeline, right? For the second case can't we use the tokenizer similar to how it's done in the transformers NER example? After the For the cases where this fails we could expect the user to be able to transform the data into any of those two formats and it should also work. We can document how to use the evaluator with other formats in the documentation. What do you think? |
Thank you for your feedback!
So from what I looked up, I found some models for POS tagging in Japanese/Chinese, as an example https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos trained on https://huggingface.co/datasets/universal_dependencies . It's actually not an issue because the tokenizer has no multi-ideogram tokens. See from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
rawdata = load_dataset("universal_dependencies", "ja_modern")
tokenizer = AutoTokenizer.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")
model = AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")
print(rawdata["test"][1])
print("----------")
inp = rawdata["test"][1]["tokens"]
encoded = tokenizer.encode(inp, return_tensors="pt", is_split_into_words=True)
print(encoded)
inp = rawdata["test"][1]["text"]
encoded = tokenizer.encode(inp, return_tensors="pt")
print(encoded) For other datasets, (e.g. https://huggingface.co/datasets/xtreme/ or https://huggingface.co/datasets/wikiann/) there seem to be no models on the Hub for the Chinese/Japanese splits. I guess there are plently of models there https://github.com/PaddlePaddle/PaddleNLP though.
Looking at the datasets for token-classification on the Hub, all of them on the first page have inputs as list of words and labels as well. Do you know of a popular dataset where labels are a list of offset/span tuples? I mentioned this case not being support but I could actually not find a relevant dataset for it. Maybe we don't need to support this case?
Yes, this is probably the simpliest actually. Avoid the
I could be mistaken, but I don't think so (see huggingface/transformers#16438 (comment) ). It probably does not matter if we use the workflow I suggest above. The only case left which could be an issue is if the model uses a tokenizer that may consider as a token several ideograms (e.g. What do you think? |
I think most labeling tools use this format for the exports - so it would be nice if we could add this format at some point. If it requires a ton of work we can address this in a subsequent PR.
If possible I would like to avoid calling See here: https://huggingface.co/docs/evaluate/main/en/custom_evaluator
I don't think we need to worry about these cases too much now. Let's make the limitations are well documented. |
Understood. Although I think it is not too critical to call
So I thought more about this later, and contrary to what I said above, I think what you suggest is very fine. The message I linked just stated that you can not retrieve the exact input text with from transformers import pipeline
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer
data = load_dataset("conll2003")
model_name = "elastic/distilbert-base-uncased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
kwargs = {"ignore_labels": []}
pipe = pipeline(task="token-classification", model=model, tokenizer=tokenizer, **kwargs)
inp = data["validation"][15]
print("Original input:", inp["tokens"])
print("Length original input:", len(inp["tokens"]))
tokenized_inp = tokenizer(inp["tokens"], is_split_into_words=True)
print("Length tokenized input:", len(tokenized_inp["input_ids"]))
word_ids = tokenized_inp.word_ids(0)
print("word_ids:", word_ids)
# the pipeline gives no output for tokens with `None` word ids
word_ids_no_none = [word_id for word_id in word_ids if word_id != None]
# the pipeline may give as output labeled tokens that are part of the same word, keep track
# of the indexing to match the true labels on words
index_tokens_word_start = []
for j, word_index in enumerate(word_ids_no_none):
if j == 0:
index_tokens_word_start.append(j)
elif word_index != word_ids_no_none[j - 1]:
index_tokens_word_start.append(j)
print("Length index_tokens_word_start:", len(index_tokens_word_start))
# we remove tokens corresponding to `None` word ids, i.e. tokens added by the tokenizer since
# we will tokenize again later on
token_indexes = [i for i in range(len(word_ids)) if word_ids[i] != None]
to_decode = [tokenized_inp["input_ids"][i] for i in token_indexes]
decoded = tokenizer.decode(to_decode)
print("Input to pipeline:", decoded)
tokenized_decoded = tokenizer(decoded)
assert tokenized_decoded["input_ids"] == tokenized_inp["input_ids"]
res = pipe(decoded)
print("Length pipeline output:", len(res)) printing
What do you think @lvwerra ? Here we do not call any methods from To get the labels, we keep the same logic as in this PR, i.e. https://github.com/fxmarty/evaluate/blob/6ebf2bcaf3f8a8de2f027b122652681a3b3ac60b/src/evaluate/evaluator/token_classification.py#L212-L237 |
6ebf2bc
to
d8a5d88
Compare
5a3cb14
to
fbfbec7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work - just a few minor comments. To be sure everything works as expected you could run the parity test on a few datasets and on larger subsets.
PS: The pipeline_call
should now also return the performance metrics and thus you need to alsoupdate the tests since the performance metrics are then also in the returned dict (aa81085).
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fxmarty, thanks for the changes - just a few nits. Did you run the parity tests on 2-3 datasets with larger sample size as well as checking the metric matches what some models report? Just to double check that it not only works correct on conll.
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added three suggestions to make it compatible with the now merged device placement.
Thanks a lot for the review! I ran the parity test as well on philschmid/distilroberta-base-ner-wikiann with https://huggingface.co/datasets/wikiann/viewer/en/validation , we do match. I wanted to run sachaarbonel/bert-italian-cased-finetuned-pos with https://huggingface.co/datasets/xtreme/viewer/udpos.Italian/test , but it is more tricky as there is no run_pos.py script to compare to and this requires some changes to run_ner.py. Looking for models to try out, I realized many have bad label2id mapping (like "LABEL_0", etc.). |
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
This PR adds a
TokenClassificationEvaluator
, and refactors theevaluator.py
into anevaluator/
subdirectory to avoid too long files, following https://github.com/huggingface/transformers/tree/main/src/transformers/pipelines .There are no changes to the
Evaluator
base class andTextClassificationEvaluator
, just refactoring.This evaluator will fail for the following cases:
"he oh"
can be a token).TokenClassificationPipeline
with labels (e.g.ner_tags
) transformers#17139I am very much happy if you have propositions on how to improve this. Basically, most datasets on the Hub have the input column split by word, while the
TokenClassificationPipeline
expects a single string as input (nois_split_into_words
option).The hack I use as a workaround is to join all words into a single string, by default joining with
" ".join(data["tokens"])
wheredata["tokens"]
is something like["I", "am", "an", "example", "."]
.The problem with this approach is that the
BatchEncoding.word_ids
from the tokenizer output may yield word indexes that do not match the true word indexes, hence making the mapping with the referencesner_tags
impossible. For example, tokenizingGermany 's representative to the European Union
withpipe.preprocess()
, we may have'
ands
having different word ids as in the issue I linked above.The solution chosen is to tokenize each word (i.e. each item from the dataset input column) separately, in order to retrieve the index of the first token for all words. This approach, as stated above, may well break for languages as Chinese/Japanese where a token may be made of several words (in the sense of several items from the dataset input column, e.g. in https://huggingface.co/datasets/msra_ner ).