This repository contains the implementation of the paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. Note that, the authors have not released the original implementation of the paper.
Abstract: We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
src/t5.py
file.For finetuning TiLT on CORD, the example along with the results are present here
Similarily, for finetuning TiLT on FUNSD, the example along with the results are present here
| Model Name | Dataset Name | Number of Parameters | Overall Precision | Overall Recall | Overall F1 Score | Overall Accuracy | |—————–|————–|———————-|——————-|—————-|——————|——————| | TILT | FUNSD | 225M | 57.58 | 42.25 | 48.87 | 83.60 | | TILT | CORD | 225M | 64.81 | 62.64 | 63.71 | 80.52 | | TILT(Original) | CORD | 230M | — | — | 95.11 | — |
Note, that in the case of my results on CORD, the model has not been pre-trained (the weights are intialized from the hugging face’s implementation), and it has been trained for 30 epochs, while in the original paper, the authors have trained on 360,000 steps which is roughly equivalent to 360,000 / 100 = 360 epochs. (100 comes from 800 / 8, since 8 is the batch size mentioned in the paper, and 800 are the training examples in the CORD dataset)
If you find this repository useful, please cite the following paper:
@inproceedings{powalski2021going,
title={Going full-tilt boogie on document understanding with text-image-layout transformer},
author={Powalski, Rafa{\l} and Borchmann, {\L}ukasz and Jurkiewicz, Dawid and Dwojak, Tomasz and Pietruszka, Micha{\l} and Pa{\l}ka, Gabriela},
booktitle={Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part II 16},
pages={732--747},
year={2021},
organization={Springer}
}
This project is licensed under the MIT License - see the LICENSE file for details