Le DX, Thoma GR
Proc. World Multiconference on Systems, Cybernetics and Informatics (SCI). 2000 Jul.;X: 348-52.
The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a combination of geometry-based and content-based zone features calculated from optical character recognition (OCR) output. Geometry-based and content-based features are derived from geometric zone information and zone contents respectively. A new feature called “single and multiple column zone vertical area string pattern” is also proposed to normalize document image pages. After normalizing document pages, a template matching algorithm calculates similarity classification features by matching vertical area string patterns of document pages to those of predefined layout document structures. Similarity classification features and both geometry-based and content-based zone features are then input into a rule-based learning system for the final decision on the page layout classification structure. The performance of our document page layout classification scheme has been evaluated using a sample size of several hundred images of biomedical journal pages. Preliminary evaluation results show that our approach is capable of classifying journal pages into different classes of physical layout structures at an accuracy of more than 96 %.