Inactive Project: Medical Article Records Groundtruth (MARG)

MARG graphicWhy?
MARG is a freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its “ground truth”. Research in document image analysis and understanding is greatly facilitated by such repositories for the design, training, and testing of algorithms for data identification and extraction.

What is it?
The data here is ground truth for representative article page images drawn from the corpus of important biomedical journals. It is thus most suitable for the development of algorithms to locate and extract text from the bibliographic fields typical of articles within such journals. These fields include the article title, author names, institutional affiliations, abstracts and possibly others. Only the first page of each article is available, or the second page if the abstract runs over. This data has been collected during the operation of a system called MARS (for Medical Article Records System). Mars was developed by the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, an R&D division of the U.S. National Library of Medicine. MARS combines scanning, optical character recognition (OCR), and document image analysis techniques to automatically extract bibliographic data from paper-based biomedical journals to populate the Library’s flagship database, MEDLINE®, used worldwide by biomedical researchers and clinicians. This data includes document images and OCR-converted and operator-verified data at the page, zone, line, word, and character levels. In addition to providing a public site for researchers worldwide to develop and test their algorithms, we offer a tool to graphically visualize the ground truth data and conduct comparisons and analysis. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researcher’s program to the ground truth data. The ground truth and results data are in XML, and Rover is written in Java.

Who is it for?
This data is for the computer science and informatics research communities interested in the development of innovative and efficient algorithms for automatic zoning (page segmentation), labeling (field identification), and syntax reformatting to adhere to the conventions of the database for which the fields are extracted.

Glenn Ford

Project Links:
MARG Results