Projects
Document Layout Understanding — DocFormer
2021 – 2022
Multi-modal document understanding using the DocFormer architecture — a transformer that jointly encodes text, visual features, and spatial layout for Visual Document Understanding tasks. Conducted under Dr. Santanu Chaudhary at IIT Jodhpur. Extended the original English-only model to multilingual document settings.
GitHub ↗
Medical Visual Question Answering
2021
Medical VQA system that answers natural language questions about clinical images by jointly reasoning over vision and language. CNN-based visual encoders with cross-modal attention, achieving 90% accuracy on benchmark datasets.