Simple Algorithm for Biomedical Abbreviation Definitions Identification
210 likes | 315 Vues
An algorithm to identify abbreviation definitions in biomedical text using a heuristic approach, evaluating performance through precision and recall on MEDLINE abstracts. Java code implementation for finding long forms. Discussion covers missed pairs and tradeoffs.
Simple Algorithm for Biomedical Abbreviation Definitions Identification
E N D
Presentation Transcript
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang
The Problem – to Identify Acronyms • To identify <“short form”, “long form”> pairs from biomedical text: • Short form is abbreviation of long form • There exists character mapping from short form to long form • Example: • Gcn5-related N-acetyltransferase (GNAT) • A non-trivial problem: • Words in long form may be skipped • Internal letters in long form may be used
Previous Work • Machine learning approach • Linear regression (Chang et al.) • Encoding and compression (Yeates et al.) • Heuristic approach • Rule-based • Factors considered include: • Distance between definition and abbreviation • Number of stop words • Capitalization
Step 1: Identifying Candidates • Consider only two cases: • long form ‘(‘ short form ‘)’ • short form ‘(‘ long form ‘)’ • Short form: • No more than 2 words • Between 2 and 10 chars • At least one letter • First char alphanumeric • Long form: • Adjacent to short form • No more than min(|A| + 5, |A| * 2) words
Step 2: Identifying Correct Long Forms • From right to left, the shortest long form that matches the short form: • Each character in short form must match a character in long form • The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form
Java Code for Finding the Best Long Form for a Given Short Form
Evaluation • 1000 randomly selected MEDLINE abstracts • 82% recall, 95% precision • Medstract Gold Standard Evaluation Corpus • 82% recall, 96% precision • Compared with • 83% recall, 80% precision (Cheng et al., linear regression) • 72% recall, 98% precision (Pustejovsky et al., heuristics)
Missing Pairs • Skipped characters in short form • <CNS1, cyclophilin seven suppressor> • No match • <5-HT, serotonin> • Out of order • <ATN, anterior thalamus> • Partial match • <Pol I, RNA polymerase I>
Discussion • Cons: • Simple method • Decent performance • Questions: • Tradeoff between complexity of rules and performance • Generality of the heuristic rules • Heuristics vs. machine learning
Mining MEDLINE for Implicit Links between Dietary Substances and Diseases P. Srinivasan & B. Libbus U. Iowa Presented by Jing Jiang
The Goal – to Discover Implicit Links between Topics • Open discovery • Start from topic A • Navigate through intermediate topics B1, B2, etc. • Reach terminal topics C1, C2, etc. • Closed discovery • Start from topics A and C • Find connections B1, B2, etc.
Terminology • Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term • MeSH: Medical Subject Heading • UMLS types: Unified Medical Language System semantic types
Open Discovery Algorithm • Input: • Topic A • Two sets of UMLS types ST-B & ST-C • Threshold M • Output: • Terms related to A and of some type in ST-C
Open Discovery Algorithm (cont.) • Build topic A’s profile AP • For each type in ST-B, select M top terms B1, B2, etc. from AP • Build Bi’s profiles BPi • Build combined profile CP from BPs limited to types in ST-C • Remove terms directly linked to A from CP
Building Profile for Topic A • Search PubMed for A • Extract MeSH terms from relevant documents • Compute TF * IDF • TF: # occurrences of the term in retrieved document set • IDF: log(N/TF) • N: # retrieved documents • Normalize the weight vector
Testing with Turmeric • Topic A: Turmeric • ST-B: • Gene or Genome • Enzyme • Amino Acid, Peptide or Protein • ST-C: • Body Part, Organ or Organ Component • Disease or Syndrome • Neoplastic Process • M: 5, 10, 15
Results • B terms: • 37% recall, 38% precision (compared with manually identified terms) • C terms: • 67% recall, 67% precision (compared with manual results)
Discussion • Cons: • Simple method • Domain knowledge (MeSH terms, UMLS types) to shape search direction • Questions: • TF & IDF? • Longer path? • What relationships? • Co-occurrence = link?