Archiving a Historic Medico-legal Collection: Automation and Workflow Customization

Misra D, Mao S, Rees J, Thoma GR
Proc IS&T Archiving 2007. Arlington, Virginia, May 2007; 157-61

The U.S. National Library of Medicine (NLM) has acquired a historical collection of documents, released by the Food and Drug Administration, specifying the Notices of Judgment (NJs) against manufacturers of adulterated or misbranded food, drugs and cosmetics. These documents, consisting of 70,000+ pages containing more than 65,000 NJs, are to be preserved and made accessible over the long term due to their legal and historical value. We developed a preservation system, named SPER (System for Preservation of Electronic Resources), based on DSpace infrastructure, for archiving and disseminating NJs contained in these documents. For efficiency and cost-effectiveness, we developed algorithms to automatically identify the NJs and extract metadata from their contents, and then have an archivist review and edit the metadata, and ingest the NJs into the archive. Contents of the documents are also captured as text streams to provide full-text search capability for the NJs. These functionalities required a number of changes to the open source DSpace software, including changing the ingest interface and workflow, handling metadata schema that does not map to Dublin Core, and enhancing the database schema. This paper describes the overall SPER system, customized workflow for automated metadata extraction, the automated metadata extraction process, and an estimate of labor savings through automation.