Simplified Algorithm for Biomedical Abbreviation Identification

Identifying Abbreviation Definitions in Biomedical Text Ariel Schwartz Marti Hearst

The Problem • The volume of biomedical text is growing at a fast rate. New abbreviations are introduced frequently. • Manual abbreviation dictionaries are out of date. • The goal is to have a simple, fast and accurate algorithm to identify abbreviations and their definitions in biomedical text. • We are interested in this algorithm, as one of many preprocessing steps we apply to biomedical texts, in order to be able to extract meaningful information from these texts.

Abbreviation Examples • “Heat-shock protein 40 (Hsp40) enables Hsp70 to play critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.” • “Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.” • “Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”

Related Work • Pustejovsky et al. present a solution based on hand-build regular expression and syntactic information. Achieved 72% recall at 98% • Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision. • Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text. • Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries. * Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.

The Algorithm • Much simpler than other approaches. • Extracts abbreviation-definition candidates adjacent to parentheses. • Finds correct definitions by matching characters in the abbreviation to characters in the definition, starting from the right. • The first character in the abbreviation must match a character at the beginning of a word in the definition. • To increase precision a few simple heuristics are applied to eliminate incorrect pairs. • Example: Heat shock transcription factor (HSF). • The algorithm finds the correct definition, but not the correct alignment: Heat shock transcription factor

Results • On the “gold-standard” the algorithm achieved 83% recall at 96% precision.* • On a larger test collection the results were 90% recall at 95% precision. • An alternative algorithm, based on modification of the Park and Byrd algorithm using decision lists, achieved only slightly better results – 83% recall at 97% precision, and 90% at 96% precision. • These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms. * Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.

Simplified Algorithm for Biomedical Abbreviation Identification

Simplified Algorithm for Biomedical Abbreviation Identification

Presentation Transcript

Abbreviation rules

Identifying Abbreviation Definitions in Biomedical Text

Circle the abbreviation.

Introduction to Biomedical Informatics Text Mining

Disambiguation of Biomedical Text

Text Mining in Biomedical Research

Capitalization and Abbreviation

Connecting Pieces in a Text: Strategies for Identifying Inferences

Information Extraction from Biomedical Text

Glossary and abbreviation

Abbreviation

Biomedical Text Processing and HVP/INSIGHT

Biomedical Text Analysis

Biomedical natural language processing and text mining

Biomedical text mining

Gleaning Relational Information from Biomedical Text

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

Research Opportunities in Biomedical Text Mining

Distribution of information in biomedical abstracts and full-text publications

Identifying properties, definitions, postulates, and theorems

Identifying Comparative Sentences in Text Documents