Data entry for the thousands of bibliographic databases around the world from information in journal articles continues to be heavily manual. At the National Library of Medicine (NLM) we are automating the production of bibliographic records for MEDLINE, NLM's premier database used by clinicians and researchers worldwide. As a first step we have developed a system called MARS (for Medical Article Record System) that involves scanning and converting by optical character recognition (OCR) the abstracts that appear in journal articles, while keyboarding the remaining fields (e.g., article title, authors, affiliations, etc). This system has been in production since 1996 and employs a team of professionals to process 600 articles daily.
A second generation system is now being designed which automatically extracts the remaining fields. This system employs scanning and OCR as well, in addition to modules that automatically zone the scanned pages, identify the zones as particular fields, and reformat the field syntax to adhere to MEDLINE conventions. The work in developing the second generation system consists of developing algorithms to detect page zones (page segmentation), automatically labeling these zones by field name (article title, author, affiliation, abstract), and then automatically reformatting the zone text syntax. The system relies on a database to keep track of the workflow as well as serve as a repository for data extracted from the scanned page to be used by subsequent processes.