Mao S, Misra D, Seamans J, Thoma GR
IS&T Archiving 2005 Conference, April 2005; 48-53.
Among the digital material considered for preservation at the U.S. National Library of Medicine (NLM) are TIFF, PDF and HTML files of biomedical journals, laboratory notebooks, correspondence of major figures in biomedical research, and similar documents. Although most of these materials are already in digital form (either as born-digital information, or converted to digital form through scanning), preservation of these materials involve complex administrative and technical issues, such as obtaining and storing adequate levels of metadata for a preserved resource, assuring intellectual integrity of the contents, and avoiding technical obsolescence of encoded information. [1,2]. An R&D project has been initiated at NLM to develop a prototype system that would help investigate the key technical functions required to effectively preserve NLM’s digital resources over the long term. This system, named the System for Preservation of Electronic Resources (SPER) has had its initial design and implementation phase completed. Here we describe the main functions of SPER, and the strategies adopted in designing the system to meet these functionalities in a modular and cost-effective manner. In particular, an automated metadata extraction subsystem is designed to minimize manual entry, using string matching and machine learning techniques. Also given are preliminary performance assessments of the subsystems in this prototype. We discuss the overall system architecture, automated metadata extraction techniques, and file migration in the SPER system.