Document Layout Understanding

Multi-modal transformer for Visual Document Understanding in English and multilingual settings

Guided by: Dr. Santanu Chaudhary, IIT Jodhpur     Duration: July 2021 – May 2022

Overview

This project explored Visual Document Understanding (VDU) — the task of comprehending the structure and semantics of documents that combine text, images, and layout information. We worked with DocFormer, a multi-modal transformer-based architecture that jointly models textual, visual, and spatial features of document images.

Key Contributions

  • Applied DocFormer for document understanding tasks in English and multilingual settings, extending its applicability beyond English-only benchmarks.
  • Investigated how spatial (layout) information fuses with textual tokens in transformer attention to improve downstream task performance.
  • Evaluated on standard VDU benchmarks including form understanding and document classification tasks.

What is DocFormer?

DocFormer is a multi-modal transformer that encodes:

  • Text tokens (from OCR output)
  • Visual features (from a CNN backbone over the document image)
  • Spatial/layout features (bounding box coordinates)

These modalities are jointly processed through a unified transformer, enabling the model to capture cross-modal dependencies unique to document understanding.

Skills & Tools

Python · PyTorch · Transformers (HuggingFace) · DocFormer · OCR · Multi-modal Learning

View on GitHub

References