Naive Bayes Classifier for Extracting Bibliographic Information From Biomedical Online Articles

Kim J, Le DX, Thoma GR
Proc 2008 International Conference on Data Mining. Las Vegas, Nevada, USA. July 2008;II:373-8


A Naive Bayes classifier has been developed to extract grant numbers, a key piece of bibliographic information, from online, HTML-formatted, biomedical articles for the National Library of Medicine’s MEDLINE database. Grant numbers identify research support from funding organizations, and are part of the MEDLINE citations. 47,362 sentences are collected from articles cited in the MEDLINE database to train and test the classifier, and 4,721 words are identified as suitable features for classification. Experimental results are evaluated using three measures: Precision, Recall, and F-Measure, all of which exceed 98.05%.