Le DX, Straughan SR, Thoma GR
Proc. 6th World Multiconference on Systemics, Cybernetics and Informatics, eds: Callaos N, et al. 2002 July;III: 86-91.
Most current commercial optical character recognition (OCR) systems can accurately recognize the text in documents written in a single language. However, when dealing with Greek characters embedded in predominantly English text, these systems do not perform well, and most OCR systems do not recognize the characters as belonging to the Greek alphabet. As a result, the degree of manual review required to validate and correct OCR errors is high. To handle this problem, we propose a new technique based on features calculated from the output of multiple OCR systems, and combined with string pattern matching and document content analysis to improve the recognition of both Greek characters and regular text. Our proposed technique uses two passes of a document page image through OCR systems that use different recognition languages. Experiments carried out on a sample of medical journals show the feasibility of using the proposed technique for Greek character recognition. Preliminary evaluation conducted on a sample of medical journal page images shows that our approach improves the recognition of Greek characters embedded within predominantly English language text.