Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Wang, Zhiheng Huang

Abstract

Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we study the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F1 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. Our results demonstrate the need for increased scrutiny regarding how tokenization is done in extractive tasks and the benefits of consistent tokenization during training.

Anthology ID:: 2023.findings-emnlp.887
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13300–13310
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.887/
DOI:: 10.18653/v1/2023.findings-emnlp.887
Bibkey:
Cite (ACL):: Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Wang, and Zhiheng Huang. 2023. Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13300–13310, Singapore. Association for Computational Linguistics.
Cite (Informal):: Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks (Sun et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.887.pdf

PDF Cite Search Fix data