OCR Correction Using Historical Relationships from Verified Text in Biomedical Citations

Hauser SE, Sabir TF, Thoma GR
Proc. of 2003 Symposium on Document Image Understanding Technology. College Park MD: Institute for Advanced Computer Studies, University of Maryland. 2003 April;: 171-7.

The Lister Hill National Center for Biomedical Communications has developed a system that incorporates OCR and automated recognition and reformatting algorithms to extract bibliographic citation data from scanned biomedical journal articles to populate the NLM’s MEDLINE database. The multi-engine OCR server incorporated in the system performs well in general, but fares less well with text printed in the small or italic fonts often used to print institutional affiliations. Because of poor OCR and other reasons, the resulting affiliation field frequently requires a disproportionate amount of time to manually correct and verify. In contrast, author names are usually printed in large, normal fonts that are correctly recognized by the OCR system. We describe techniques to exploit the more successful OCR conversion of author names to help find the correct affiliations from MEDLINE data.