Combining Static Classifiers and Class Syntax Models for Logical Entity Recognition in Scanned Historical Documents

Mao S, Mansukhani P, Thoma GR
Proc IEEE CVPR, Minneapolis, Minnesota, June 2007, pp. 1-8.

Class syntax can be used to 1) model temporal or locational evolvement of class labels of feature observation sequences, 2) correct classification errors of static classifiers if feature observations from different classes overlap in feature space, and 3) eliminate redundant features whose discriminative information is already represented in the class syntax. In this paper, we describe a novel method that combines static classifiers with class syntax models for supervised feature subset selection and classification in unified algorithms. Posterior class probabilities given feature observations are first estimated from the output of static classifiers, and then integrated into a parsing algorithm to find an optimal class label sequence for the given feature observation sequence. Finally, both static classifiers and class syntax models are used to search for an optimal subset of features. An optimal feature subset, associated static classifiers, and class syntax models are all learned from training data. We apply this method to logical entity recognition in scanned historical U.S. Food and Drug Administration (FDA) documents containing court case Notices of Judgments (NJs) of different layout styles, and show that the use of class syntax models not only corrects most classification errors of static classifiers, but also significantly reduces the dimensionality of feature observations with negligible impact on classification performance.