Unsupervised Style Classification of Document Page Images

Mao S, Nie L, Thoma GR
Proc IEEE International Conference on Image Processing, September 2005, Genova, Italy; Vol. II: 510-13

Style classification of document page images is crucial for logical structure analysis of heterogeneous collections of documents. Both layout and contextual features contain significant information about document styles. Most existing methods are supervised methods in which specific document models or classifiers are learned from a training set of document page images with known class labels. In this paper, we propose an unsupervised classification method that involves no training or manual selection of algorithm parameters. In particular, we first represent each document page as an ordered labeled X-Y tree. A tree matching algorithm is then used to compute style dissimilarity between two document pages. We propose a set of tree edit cost functions based on Karl Pearson distance between two multivariate feature observations, which is robust to the over-segmentation problem and zone length variations of same logical entities. Finally, the K medoids algorithm is used to find an optimal grouping of the trees into K clusters, each of which corresponds to a distinct document style. We evaluate our algorithm on test datasets with different cluster sizes and degrees of style similarity. Experimental results show our algorithm achieved an average classification accuracy of 95.69% over six datasets consisting of 150 pages of 11 different styles.