Biomedical Named Entity Recognition RamakanthKavuluru NLP Seminar – 8/21/2012
What are named entities? • The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells
What are named entities? • The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells Drug Biologically Active Substance Disorder Enzyme Organic Chemical Cell
What are named entities? • The benefits of taking cholesterol lowering statin drugsoutweigh the risks even among people who are likely to develop diabetes. • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells Drug Cholesterol lowering drugs Biological Function
Why do we need to extract them? • To provide effective semantic search • Find all discharge summaries of patients that have a history of diabetes and obesity and have taken statins as part of their treatment. • Find all biomedical articles that discuss the dopamine neurotransmitter in the context of depressive disorders. Clinical Trial Recruitment Literature Review
Why do we need to extract them? • To use as features in machine learning for effective text classification • To build semantic clusters of textual documents to understand evolving themes • Reduce noise by avoiding key words that are not indicative of the classes or clusters • Recently, as a first step in relation extraction and hence in knowledge discovery
A major task in text mining • Extract information from textual data • Use this information to solve problems • What type of information? • relevant concepts - a medical condition or finding, a drug, a gene or protein, an emotion (hope, love, …) • Relevant (binary) relations – drug TREATS a condition, protein CAUSES a disease • What are the typical questions? • Does a pathology report indicate a reportable case? • Which patients satisfy the criteria for a clinical trial?
Knowledge Discovery • VIP Peptide – increases – Catecholamine Biosynthesis • Catecholamines – induce – β-adrenergic receptor activity • β-adrenergic receptors – are involved – fear conditioning VIP Peptide – affects – fear conditioning ????? In Cattle In Rats In Humans
Linguistic Variation • Derivational variation: cranial, cranium • Inflectional variation: coughed, coughing • Synonymy • nuerofibromin 2, merlin, NF2 protein, and schwannomin. • Addison’s disease, adrenal insufficiency, hypocortisolism, bronzed disease • Feeding problems in newborn – The mother said she was having trouble feeding the baby.
Polysemy • Merlin – both a bird and protein in UMLS • Discharge • Patient was prescribed codeine upon discharge • The discharge was yellow and purulent • Abbreviations • APC: Activated protein C, Adenomatosis polyposis coli, antigen presenting cell, aerobic plate count, advanced pancreatic cancer, age period cohort, antibody producing cells, atrial premature complex
Negation • Nearly half of all clinical concepts in dictated narratives are negated • There is no maxillary sinus tenderness • Implied absence without negation • Lungs are clear upon auscultation So, • Rales: Absent • Rhonchi: Absent • Wheezing: Absent
Controlled Terminologies Controlled vocabulariesor taxonomies • Gene Ontology (gene products) • most cited, 450 per year in PubMed • Total of 33000+ terms • SNOMED CT (about 300K+ concepts) • NCI Thesaurus , ICD-9/10, ICD-0-3, LOINC, MedlinePlus • UMLS Metathesaurus (integration of 140+ vocabularies) • 2.3 million concepts
more Metathesaurus • CUIs • LUIs • SUIs • AUIs
Semantic Types and Relations • NLM Semantic Network, the type system behind UMLS Metathesaurus • Semantic Types (135) • Semantic Groups (15) • Semantic Relations (54) • Specialist Lexicon • Malaria, malarial • Hyperplasia, hyperplastic How do we extract named entities?
Metamap from NLM Identify phrases: Use SPECIALIST parser Map to CUIs: Use SPECIALIST Lexicon, Metathesaurus and Semantic Network
Output of syntactic analysis • Syntactic Analysis – “ocular complications of myasthenia gravis” • Ocular (adj), complications (noun), of (prep), myasthenia (noun), gravis (noun) • gives noun phrases (NP): “Ocular complications”and “Myasthenia gravis” • Prepositions are ignored • In a given NP, you have a head and modifiers: • Ocular (mod) and complications (head) • How about “male pattern baldness”?
Candidate identification • Look for all variants in Metathesaurus strings and identify those candidate concepts (CUIs) that contain at least one variant as a substring • Example: For ocular complication, obtain all Metathesaurus strings that contain any of the following as substrings • Optic complication • Eyes complication • Opthalmic complicated • ….
Mapping and Evaluation • So now we have a bunch of candidate CUIs based on presence of variants of the given phrase in Metathesaurus strings. How do we select the best candidate. • Use several measures to compute a rank • Centrality (involvement of head) • Variation (average of inverse distance scores) • Coverage • Cohesivness
Metamap Options • Types of variants: include or exclude derivational variants • Word sense disambiguation • Discharge (bodily secretion VS release the patient) • Concept gaps • Obstructive apnea mapping to “obstructive sleep apnea” or “obstructive neonatal apnea” • Term processing • Process the input string as a single concept, that is, don’t split it into noun phrases
Output options • Human readable format • XML format • Restrictions based on certain vocabularies: consider only ICD-9 • Restrictions based on certain types: consider only pharmacological substances (i.e., drugs) DEMO TIME: Daniel Harris
References • An overview of Metamap: Historical Perspectives and Recent Advances, Alan Aronson and Francois Lang • Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program, Alan Aronson • Comparison of LVG and Metamap Functionality, Alan Aronson • Lexical, Terminological, and Ontological Resources for Biological Text Mining, Olivier Bodenreider