Generating Robust Features for Style-Independent Labeling of Bibliographic Fields in Medical Journal Articles

Mao S, Kim J, Le DX, Thoma GR
Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics.2003 July;III:53-6.

Bibliographical data such as title, author, affiliation, and abstract are crucial for indexing biomedical journal articles. The Medical Article Records System (MARS) has been developed at the National Library of Medicine (NLM) to automate bibliographical data extraction for MEDLINE , the NLM’s premier database of citations to the biomedical literature. The automatic extraction of bibliographic data involves the process of assigning logical labels (title, author, affiliation, and abstract) to homogeneous regions or zones on page images. While an OCR- and rule-based labeling module (called ZoneCzar) in MARS can reliably label medical journals with regular layout styles, it cannot accurately label the journals with arbitrary or unusual layout styles, and new rules have to be manually created for these journals. Furthermore, the OCR zoning errors, particularly merging errors, can greatly affect the labeling accuracy of ZoneCzar. In this paper, we describe an algorithm for automatic generation of robust features that are used by the labeling algorithm to perform style-independent labeling.