Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals

Kim I, Le DX, Thoma GR
In: Callaos N, Lesso W, editors. SCI 2005. Proc. 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 401-5


An R&D division of the National Library of Medicine (NLM) has developed the Web-based Medical Article Records System (WebMARS) to create citations from online biomedical journals. This paper presents one important part of this system, the automated cleanup module that extracts bibliographic information from HTML-formatted text based on a rule-based approach. A learning scheme comparing the output of the cleanup module to the verified processing result is newly introduced to create and update cleanup rules automatically, thereby minimizing the manual effort for rule setting and improving the performance of the cleanup processing. Experimental results show that the proposed automated cleanup module can effectively detect and extract the bibliographic data of interest from HTML-formatted online journal articles using relevant rules identified through the learning process.