Bayesian Learning of 2D Document Layout Models for Automated Preservation Metadata Extraction

Mao S, Thoma G
Proc. of the Fourth IASTED International Conference on Visualization, Imaging, and Image Processing (VIIP). 2004 Sept.;:329-34.


Digital preservation addresses the storage, maintenance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the information required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Document layout analysis is a process of partitioning document images into hierarchically structured and labeled homogeneous physical regions. Descriptive metadata such as bibliographic information can then be extracted from these segmented and labeled regions using OCR. While numerous algorithms have been proposed for document layout analysis, most of them require manually specified rules or models. In this paper, we first define the hierarchical 2D layout model of document pages as a set of attributed hidden semi-Markov Models (HSMM). Each attributed HSMM represents the projection profile of the character bounding boxes in a physical region on either the X or Y axis. We then describe a Bayesian-based method to learn 2D layout models from the unstructured and labeled physical regions in a set of training pages. We compare the zoning and labeling performance of the learned HSMM-based model, a learned baseline model, and two rule-based systems on 69 test pages and show that the HSMM-based model has the best overall performance, and comparable or better performance for individual fields.