RCDC Text Mining and Indexing for Categorization
Brown Bag Lecture by Richard Ikeda, Ph.D. and Jake Scholl, M.S. | 5/1/2018 11:00AM – 12PM | 7th Floor Conference Room, Bldg 38A
The Research, Condition, and Disease Categorization (RCDC) system was developed as a requirement of the NIH Reform Act of 2006 to uniformly code research activities across all ICs. The RCDC system is automated for text mining, indexing, and categorization purposes and utilizes a centralized biomedical thesaurus that is regularly curated. Project text is normalized using Natural Language Processing (NLP) logic, matched to existing terms in the RCDC thesaurus, and then weighted based on the relative frequency of term use in the project. Text is mined from the title, abstract, public health relevance and specific aims, and compared to the thesaurus to produce an individual project index. This project index is compared to the expert-vetted Category Fingerprint (aka, Category Definition) to determine the most scientifically relevant project listing for a category. Projects are assigned a numerical match score by comparing the project index to the category fingerprint.
Dr. Ikeda is Director of NIH/OD Office of Research Information Services (ORIS). His current roles include: Co-chair of the Data Stewardship Sub-Committee of the Federal Demonstration Project’s Compliance Committee, a board member of the international Open Researcher and Contributor Identifier ID (ORCID) organization, and as the NIH representative and coordinator for implementing the Federal-wide Data Act. He received a Ph.D. in Chemistry in 1984 from the California Institute of Technology.
Mr. Scholl is a Scientific Information Analyst at ORIS, working with subject matter experts across NIH to develop and maintain research categories prior to their reporting. He received an M.S. in Zoology from Colorado State University where he examined the intersection of behavior and physiology of the honeybee model. Prior to NIH, he worked at NSF.