Automated Metadata Extraction to Preserve the Digital Contents of Biomedical Collections

Thoma GR, Mao S, Misra D
Proc VIIP 2005. September 2005. Benidorm, Spain; 214-19

The long term preservation of digital objects, a growing problem as these are acquired by libraries and archives, requires appropriate systems, standards and institutional policies. A key requirement is the acquisition of metadata about the objects to enable future access and usage, as well as the migration of digital files from obsolete formats to newer ones. Metadata is data about data. It typically consists of information about the intellectual content of a digital object, the data required for appropriate digital representation and interpretation, security or rights management information, and their relation to other digital objects. The manual recording of these metadata elements is highly labor-intensive and automated means for doing this are key to successful preservation. In this paper a prototype system for digital preservation is introduced, its main functions are described highlighting the strategies adopted in designing the system to meet these functionalities in a modular and cost-effective manner, an automated metadata extraction subsystem to minimize manual entry, using string matching and machine learning techniques, is presented, and preliminary performance assessments are given.