Thoma GR, Ford G
Proc. SPIE: Document Recognition and Retrieval IX. 2002 Jan;4670: 181-90.
This paper discusses the performance of a system for extracting bibliographic fields from scanned pages in biomedical journals to populate MEDLINE, the flagship database of the National Library of Medicine (NLM), and heavily used worldwide. This system consists of automated processes to extract the article title, author names, affiliations and abstract, and manual workstations for the entry of other required fields such as pagination, grant support information, databank accession numbers and others needed for a completed bibliographic record in MEDLINE. Labor and time data are given for (1) a wholly manual keyboarding process to create the records, (2) an OCR-based system that requires all fields except the abstract to be manually input, and (3) a more automated system that relies on document image analysis and understanding techniques for the extraction of several fields. It is shown that this last, most automated, approach requires less than 25% of the labor effort in the first, manual, process.