Medical Article Record System

Medical Article Record System

The Medical Article Records System (MARS) is an automated system to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 560 journal titles that arrive at NLM in paper form, the system use document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to extract citation data (title, author names, affiliation, abstract, etc.) that NLM’s indexers use to complete bibliographic records for MEDLINE®. The system uses WCF services and Web-based GUIs for users to remotely access the system easily.

Publications/Tools

Zou J, Antani SK, Thoma GR. Localizing and Recognizing Labels for Multi-Panel Figures in Biomedical Journals. Proceedings of International Conference on Document Analysis and Recognition, November 13, 2017
Abstract

Kim J, Hong S, Thoma GR. Labeling Author Affiliations in Biomedical Articles Using Markov Model Classifiers. The 13th International Conference on Data Mining (DMIN2017), pp. 105-110, Las Vegas, USA, July 2017.
Abstract | PDF

Kim I, Thoma GR. Machine Learning with Selective Word Statistics for Automated Classification of Citation Subjectivity in Online Biomedical Articles. Proc. Int’l Conf. Artificial Intelligence (ICAI’17), pp. 201-207, Las Vegas, July 2017.
Abstract | PDF

Kim J, Thoma GR. Named Entity Recognition in Affiliations of Biomedical Articles Using Statistics and HMM Classifiers. The 2016 International Conference on Data Mining (DMIN2016), Las Vegas, USA, pp. 236-241, July, 2016.
Abstract | PDF

Kim J, Lobuglio PS, Thoma GR. Visualization of Statistics from MEDLINE. 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016), Dublin and Belfast, Ireland, pp. 290-291, June, 2016.
Abstract | PDF

Kim I, Thoma GR. Automated Classification of Author’s Sentiments in Citation Using Machine Learning Techniques: A Preliminary Study. Proc. the 2015 IEEE Conf. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2015), Niagara Falls, Canada, Aug. 12-15, 2015.
Abstract | PDF

Kim J, Le DX, Thoma GR. Identification of Investigator Name Zones Using SVM Classifiers and Heuristic Rules. 12th international Conference on Document Analysis and Recognition (ICDAR). Washington D.C., August 2013.
Abstract | PDF

Kim I, Le DX, Thoma GR. Automated method for extracting “citation sentences” from online biomedical articles using SVM-based text summarization technique. Proc. the 2014 IEEE Int’l Conf. on Systems, Man, and Cybernetics (SMC 2014), pp. 2006-2011, San Diego, October, 2014
Abstract | URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6974213

Kim I, Le DX, Thoma GR. Automated identification of biomedical article type using support vector machines. Proc. 18th SPIE Document Recognition and Retrieval, 7874:787403 (1-9), San Francisco, January 2011.
Abstract | PDF

Kim I, Le DX, Thoma GR. Identifying “comment-on” citation data in online biomedical articles using SVM-based text summarization technique. Proc. Int’l Conf. Artificial Intelligence (ICAI’12), vol. 1, pp. 431-437, Las Vegas, July 2012.
Abstract | PDF

Kim J, Le DX, Thoma GR. Combining SVM Classifiers to Identify Investigator Name Zones in Biomedical Articles. IS&T/SPIE’s 22nd Annual Symposium on Electronic Imaging. San Francisco, CA, January 2012; 8297.
Abstract | PDF

Thoma GR, Ford G, Le DX, Li Z. Text Verification in an Automated System for the Extraction of Bibliographic Data Proc. 5th International Workshop on Document Analysis Systems, Springer-Verlag: Berlin. 2002 Aug;: 423-32.
Abstract | PDF

Ford G, Thoma GR. Ground Truth Data for Document Image Analysis Proceedings of 2003 Symposium on Document Image Understanding and Technology. 2003 April 9-11;: 199-205.
Abstract | PDF

Mao S, Kim J, Thoma G. A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials Proc. International Workshop on Document Image Analysis for Libraries (DIAL2004). 2004 Jan;: 225-32.
Abstract | PDF

Zou J, Le DX, Thoma GR. Extracting a Sparsely-Located Named Entity from Online HTML Medical Articles Using Support Vector Machine Proc SPIE-IS/T Electronic Imaging. San Jose, CA. January 2008;6815:6815OP(1-10)
Abstract

Le DX, Straughan SR, Thoma GR. Greek Alphabet Recognition Technique for Biomedical Documents Proc. 6th World Multiconference on Systemics, Cybernetics and Informatics, eds: Callaos N, et al. 2002 July;III: 86-91.
Abstract | PDF

Zou J, Le DX, Thoma GR. Online Medical Journal Article Layout Analysis Proc SPIE-IS&T Electronic Imaging 2007, SPIE Vol. 6500: 65000V (1-12)
Abstract | PDF

Hauser SE, Sabir TF, Thoma GR. OCR Correction Using Historical Relationships from Verified Text in Biomedical Citations Proc. of 2003 Symposium on Document Image Understanding Technology. College Park MD: Institute for Advanced Computer Studies, University of Maryland. 2003 April;: 171-7.
Abstract | PDF

Mao S, Kim J, Thoma G. Style-Independent Document Labeling: Design and Performance Evaluation Proc. SPIE – Document Recognition and Retrieval. 2004 Jan;: 14-22.
Abstract | PDF

Pearson G, Moon CW. Bridging Two Biomedical Journal Databases with XML – A Case Study. Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001 Jul;:309-14.
Abstract | PDF

Zhang X, Zou J, Le DX, Thoma GR. Investigator Name Recognition From Medical Journal Articles: A Comparative Study of SVM and Structural SVM International Workshop on Document Analysis Systems. June 2010:121-8
Abstract | PDF

Kim J, Le DX, Thoma GR. Naive Bayes Classifier for Extracting Bibliographic Information From Biomedical Online Articles Proc 2008 International Conference on Data Mining. Las Vegas, Nevada, USA. July 2008;II:373-8
Abstract

Chen S, Mao S, Thoma GR. Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents Proc ICDAR2007. Curitiba, Brazil; September 2007, pp. 118-22
Abstract | PDF

Tran LQ, Moon CW, Le DX, Thoma GR. Web Page Downloading and Classification Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001 Jul;:321-6.
Abstract | PDF

Demner-Fushman D, Few B, Hauser SE, Thoma GR. Automatically Identifying Health Outcome Information in MEDLINE Records J Am Med Inform Assoc. 2006 Jan-Feb;13(1):52-60. Epub 2005 Oct 12.
Abstract | PDF | PMID: 16221937 | PMCID: PMC1380197

Mao S, Kanungo T. Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001 Mar;23(3): 242-256.
Abstract | PDF

Mao S, Kanungo T. Automatic Training of Page Segmentation Algorithms: An Optimization Approach International Conference on Pattern Recognition. 2000 Sept.;:531-534.
Abstract | PDF

Thoma GR, Ford G. Automated Data Entry System: Performance Issues Proc. SPIE: Document Recognition and Retrieval IX. 2002 Jan;4670: 181-90.
Abstract | PDF

Hauser SE, Schlaifer J, Sabir TF, Demner-Fushman D, Thoma GR. Correcting OCR Text by Association with Historic Datasets Proc. SPIE Electronic Imaging. 2003 Jan;5010: 84-93.
Abstract | PDF

Ford G, Hauser SE, Le DX, Thoma GR. Pattern Matching Techniques for Correcting Low Confidence OCR Words in a Known Context Proc. SPIE., Document Recognition and Retrieval VIII. 2001 Jan;4307:241-9.
Abstract | PDF

Kim J, Le DX, Thoma GR. Automated Labeling of Bibliographic Data Extracted from Biomedical Online Journals Proc. SPIE Electronic Imaging. 2003 Jan;5010: 47-56.
Abstract | PDF

Kim J, Le DX, Thoma GR. Automated Labeling in Document Images Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:111-22.
Abstract | PDF

Sabir TF, Hauser SE, Thoma GR. Historical Author Affiliations Assist Verification of Automatically Generated MEDLINE Citations AMIA Annu Symp Proc. 2006:1082
Abstract | PDF | PMID: 17238701 | PMCID: PMC1839323

Zhang X, Zou J, Le DX, Thoma GR. A Semi-supervised Learning Method to Classify Grant Support Zone in Web-based Medical Articles Proc SPIE Electronic Imaging Science and Technology, Document Recognition and Retrieval. January 2009;7247:7247 OW(1-8)
Abstract

Kim J, Le DX, Thoma GR. Naive Bayes and SVM Classifiers For Classifying Databank Accession Number Sentences From Online Biomedical Articles IS&T/SPIE’s 22nd Annual Symposium on Electronic Imaging. San Jose, CA. January 2010;7534:75340U-1 – 8
Abstract

Mao S, Rosenfeld A, Kanungo T. Document Structure Analysis Algorithms: A Literature Survey Proc. SPIE Electronic Imaging. 2003 Jan;5010:197-207.
Abstract | PDF

Hauser SE, Le DX, Thoma GR. Automated Zone Correction in Bitmapped Document Images SPIE: Document Recognition and Retrieval VII. 2000 Jan;3976: 248-58.
Abstract | PDF

Lasko TA, Hauser SE. Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.
Abstract | PDF

Kim J, Le DX, Thoma GR. Inferring Grant Support Types From Online Biomedical Articles 22nd IEEE ISCBMS. Albuquerque, NM. August 2009
Abstract | PDF

Zou J, Le DX, Thoma GR. Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal Article Segmentation Proc JCDL, June 2006, Chapel Hill, NC; 119-28
Abstract | PDF

Zhang X, Zou J, Le DX, Thoma GR. A Stacked Sequential Learning Method For Investigator Name Recognition From Web-based Medical Articles 17th Document Recognition and Retrieval Conference (SPIE-DR&R). San Jose, CA. January 2010;7534:753404-7
Abstract | PDF

Kanungo T, Mao S. Stochastic Language Model for Style-Directed Physical Layout Analysis of Documents IEEE Transactions on Image Processing. 2003 May;12 (5)5:583-596.
Abstract | PDF

Zou J, Le DX, Thoma GR. Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach Proc August 2007 ACM Symposium on Document Engineering. pp. 199-201
Abstract | PDF

Mao S, Kanungo T. Empirical Performance Evaluation of Page Segmentation Algorithms SPIE conference on Document Recognition and Retrieval. 2000 Jan.;:303-314.
Abstract | PDF

Le DX, Tran LQ, Chow J, Kim J, Hauser SE, Moon CW, Thoma GR. Automated Medical Citation Records Creation for Web-Based Online Journals Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001.
Abstract | PDF

Mao S, Kim J, Le DX, Thoma GR. Generating Robust Features for Style-Independent Labeling of Bibliographic Fields in Medical Journal Articles Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics.2003 July;III:53-6.
Abstract | PDF

Zou J, Le DX, Thoma GR. Locating and parsing bibliographic references in HTML medical articles. Int J Doc Anal Recognit. 2010 Jun 1;13(2):107-119.
Abstract | PDF | PMID: 20640222 | PMCID: PMC2903768

Kim J, Le DX, Thoma GR. Automated Labeling Algorithms for Biomedical Document Images Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;V: 352-57.
Abstract | PDF

Mao S, Nie L, Thoma GR. Unsupervised Style Classification of Document Page Images Proc IEEE International Conference on Image Processing, September 2005, Genova, Italy; Vol. II: 510-13
Abstract | PDF

Kim J, Le DX, Thoma GR. Automatic Extraction of Bibliographic Information from Biomedical Online Journal Articles Using a String Matching Algorithm Proc IEEE CBMS, June 2006, Salt Lake City, Utah; 905-10
Abstract | PDF

Le DX, Thoma GR. Page Layout Classification Technique for Biomedical Documents Proc. World Multiconference on Systems, Cybernetics and Informatics (SCI). 2000 Jul.;X: 348-52.
Abstract | PDF

Kim IC, Le DX, Thoma GR. Identification of “comment-on sentences” in online biomedical documents using support vector machines. Proc. SPIE conference on Document Recognition and Retrieval, 6500:65000O (1-8), San Jose, January 2007.
Abstract | PDF

Kim IC, Le DX, Thoma GR. Hybrid approach combining contextual and statistical information for identifying and statistical information for identifying MEDLINE citation terms. Proc. SPIE-IS/T Electronic Imaging. San Jose, CA. January 2008;6815:68150P(1-9)
Abstract | PDF

Kim J, Le DX, Thoma GR. Automated Labeling Of Biomedical Online Journal Articles In: Callaos N, Lesso W, editors. SCI 2005. Proc 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 406-11
Abstract | PDF

Mao S, Kanungo T. Software Architecture of PSET: A Page Segmentation Evaluation Toolkit International Journal on Document Analysis and Recognition. 2002 Mar;4(3):205-217.
Abstract | PDF

Le DX, Thoma GR. Automated Document Labeling for Web-Based Online Medical Journals Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;II: 411-15.
Abstract | PDF

Kim I, Le DX, Thoma GR. Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals In: Callaos N, Lesso W, editors. SCI 2005. Proc. 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 401-5
Abstract | PDF

Thoma GR, Le DX, Kim I, Kim JW, Moon C, Tran L, Zou J. Automation to Accelerate the Production of MEDLINE April 2008 Technical Report to the LHNCBC Board of Scientific Counselors.
Abstract | PDF