Research

GDP: Generic Document Pretraining to Improve Document Understanding

Akkshita Trivedi, Akarsh Upadhyay, Rudrabha Mukhopadhyay, Santanu Chaudhury

ICDAR 2024 — 18th International Conference on Document Analysis and Recognition, Athens, Greece · Lecture Notes in Computer Science, Vol. 14804, pp. 208–226

We propose a novel pretraining approach for document analysis that advances beyond conventional methods. The approach, called GDPerformer, trains a suite of unique architectures to predict both masked OCR tokens and masked OCR bounding boxes, fostering the network to learn document semantics such as structure and language. Our experiments with GDPerformerv1 and GDPerformerv2 show enhanced performance on various downstream tasks, including Semantic Entity Recognition and Extraction and Multi-Modal Document Classification with minimal task-specific data and generalization to a wide range of documents. Furthermore, our pretrained features exhibit robustness in handling noisy documents and can be easily extended to multiple languages. Our experiments indicate that the proposed pretraining strategy requires only 50K document images, making it particularly beneficial for low-resource languages.

Document Understanding Pretraining Multi-modal Transformers