If you use the dataset, please cite this our paper PDF-VQA: A New Dataset for Real-World VQA on PDF Documents accepted by ECML PKDD 2023.
@InProceedings{10.1007/978-3-031-43427-3_35,
author="Ding, Yihao
and Luo, Siwen
and Chung, Hyunsuk
and Han, Soyeon Caren",
editor="De Francisci Morales, Gianmarco
and Perlich, Claudia
and Ruchansky, Natali
and Kourtellis, Nicolas
and Baralis, Elena
and Bonchi, Francesco",
title="PDF-VQA: A New Dataset for Real-World VQA on PDF Documents",
booktitle="Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="585--601",
abstract="Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks (The full dataset is released in https://github.com/adlnlp/pdfvqa).",
isbn="978-3-031-43427-3"
}
Accepted by European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(ECML PKDD 2023)
(ECML PKDD 2023)
We introduce PDFVQA, a new Document-based Visual Question Answering dataset, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages.
The dataset contains three tasks:
- Task A: document element recognition and spatial relationship understanding among document elements on the page level. The answers should be predicted from a fixed answer space.
- Task B: structural understanding of document elements and answer extraction on the document page level.
- Task C: document understanding on the whole document level of multiple consecutive pages. There are pkl files contains Task C question answer pairs
Data Statistics of Tasks A, B, and C:
Our PDFVQA dataset specifically annotates the hierarchically logical relational graph among the document layout elements to specify the potential affiliation between document elements by identifying the parent objects and their children’s objects based on the hierarchical structures of document layouts.
Future works can directly use such relational information as the additional features for document understanding.
The dataset consists of three main splits: training, validation, and testing. For each split, we provide two types of files:
- question-answer pkl files: contains questions and their ground truth answers.
- document layout structure pkl file: contains essential document information, including the bounding box coordinates of each document layout component, textual contents inside each bounding box, and parent-child relationships (relational graph annotation). This information is crucial for understanding the structure of the visually-rich documents and identifying relevant Regions of Interest (RoIs) that might contain the answers to the questions.
We also provide the document images for each data split. The pkl files contain the information to locate the document images of each target document.
Please refer to this link to get the document images of each split.
Layout structure information of each data split: Training, Validation, Testing
Please refer to those links to get question-answer pairs for training, validation and testing splits.
Please refer to those links to get question-answer pairs for training, validation and testing splits.
Please refer to those links to get question-answer pairs for training, validation and testing splits. We also provide the official tutorial about dataset loading and a baseline model.
More tutorials and baseline code will be released gradually.
We experimented with several baselines on our PDF-VQA dataset to provide a preliminary view of different models’ performances.
- Acronym of feature aspects: Q--Question features; B--Bounding box coordinates; V--Visual appearance features; C--Contextual features; R: Relational Information.
- The better performance of VisualBERT than ViLT indicates that object-level visual features are more effective than image patch representations.
- LayoutLM2 used token-level visual and bounding box features, which shows it's ineffective for the whole document element identification.
- Our proposed graph-based model, LoSpa, achieves the highest performance compared to all baselines, which indicates the effectiveness of relational information on this task.
- The comparatively low performances on Task C of all models indicate the difficulty of document-level questions and produce massive room for improvement for future research on this task.
- Note: the detailed baseline model setup can be found in Appendix C.
The contributors of the work are:
- Yihao Ding (Ph.D. candidate at the University of Sydney)
- Siwen Luo (Ph.D. candidate at the University of S 522E ydney)
- Hyunsuk Chung (FortifyEdge, Sydney, Australia)
- Soyeon Caren Han (Senior lecturer at the University of Western Australia, Honorary senior lecturer at the University of Sydney)