Le DX, Thoma GR
Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;II: 411-15.
An increasing number of publishers are using the Internet and the World Wide Web to provide their subscribers with access to online journals. New techniques are needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. An R&D division of the National Library of Medicine (NLM) is developing an automated system, temporarily code-named WebMARS for Web-based Medical Article Records System, to download, analyze and extract bibliographic information from Web-based journal articles to produce citation records for its MEDLINE database. This paper describes one component of this system: assigning meaningful labels to text zones containing article title, author names, affiliation, and abstract. This labeling technique is based on features derived from the World Wide Web Consortium Document Object Model (W3C DOM) and an analysis of the page layout for each journal, a DOM-based document node location and content analysis, string pattern matching, and a depth-first node traversal algorithm. Experiments carried out on a variety of Web-based medical journals have proved the feasibility of this automated document labeling approach. Preliminary evaluation results on a small set of Web-based medical journal articles show that the system is capable of labeling text zones at an accuracy of over 95%. Keywords: W3C Document Object Model, Automated document labeling, MEDLINE database, National Library of Medicine.