Le DX, Thoma GR
In: Callaos N, Lesso W, editors. SCI 2005. Proc 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 3, Computer Science and Engineering. Orlando (FL): International Institute of Informatics and Systemics; c2005. 267-74
To provide online access to citations from old hardcopy indexes published from 1879 through 1965, an R&D division of the National Library of Medicine (NLM) is developing an automated system to convert bibliographic information in volumes of the printed Quarterly Cumulative Index Medicus (QCIM) to machine-readable form for inclusion in the OLDMEDLINE database. The system processes images scanned from a QCIM volume, segments and labels the image records, identifies multiple occurrences of the same record in the volume, and creates unique citation records. The record segmentation and labeling technology is based on a smearing bottom-up approach for text block segmentation, the document page layout formats, and a set of rules for record labeling that is derived from the QCIM document format guideline. Since bibliographic information can be arranged as both author entries and subject entries in a QCIM document, the duplicate records have to be detected and combined to create a single unique citation. The duplicate records are identified based on matching cross-reference information such as author names, journal title abbreviation, volume, pagination, month, and year among different entries of the same citation. The cross-reference information can also be used to correct OCR errors resulting in improving the quality of citations created. The performance of the system has been evaluated using a QCIM volume published in 1929 that consists of 95,717 citation records. Evaluation shows the technical and cost feasibility of building the proposed data conversion system.