Naive Bayes and SVM Classifiers For Classifying Databank Accession Number Sentences From Online Biomedical Articles

Kim J, Le DX, Thoma GR
IS&T/SPIE's 22nd Annual Symposium on Electronic Imaging. San Jose, CA. January 2010;7534:75340U-1 – 8


This paper describes two classifiers, Naive Bayes and Support Vector Machine (SVM), to classify sentences containing Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles. The correct identification of these sentences is necessary for the subsequent extraction of these numbers. The classifiers use words that occur most frequently in sentences as features for the classification. Twelve sets of word features are collected to train and test the classifiers. Each set has a different number of word features ranging from 100 to 1,200. The performance of each classifier is evaluated using four measures: Precision, Recall, F-Measure, and Accuracy. The Naive Bayes classifier shows performance above 93.91% at 200 word features for all four measures. The SVM shows 98.80% Precision at 200 word features, 94.90% Recall at 500 and 700, 96.46% F-Measure at 200, and 99.14% Accuracy at 200 and 400. To improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine results of the two classifiers. The final results show a measureable improvement in Recall, F-Measure, and Accuracy rates.