Automated Labeling Algorithms for Biomedical Document Images

Kim J, Le DX, Thoma GR
Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;V: 352-57.

The National Library of Medicine (NLM) has developed an automated system, named Medical Article Records System (MARS), to process bibliographic data (title, authors, affiliation, abstract, etc.) in biomedical journal articles for its MEDLINE database. This paper describes a labeling module in the MARS, which automatically extract the bibliographic data in biomedical journal articles. The labeling module is composed of two sub modules: General label type module (GLTM) and Arbitrary label type module (ALTM). Six label types, which are commonly used in the journals, are collected from several thousand journals. Journals are classified as general label types if label types of the journals belong to one of the six label types. Otherwise, journals are classified as arbitrary label types. The GLTM processes journals that belong to general label types and the ALTM processes journals that belong to arbitrary label types. Rule-based algorithms are used for both modules and the rules are derived from analysis of several journal articles and features extracted from the optical character recognition (OCR) results. There are 126 rules derived for the GLTM and 49 rules for the ALTM. Experiments conducted with several medical journal articles show relatively accurate labeling results.