970 likes | 993 Vues
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations. Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley. Outline of Talk. Goal: Extract semantics from text Information and relation extraction Protein-protein interactions Noun compounds. Text Mining.
E N D
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley
Outline of Talk • Goal: Extract semantics from text • Information and relation extraction • Protein-protein interactions • Noun compounds
Text Mining • Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text
Text Mining • Text: • Stress is associated with migraines • Stress can lead to loss of magnesium • Calcium channel blockers prevent some migraines • Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text
Text Mining • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 1: Extract semantic entities from text Stress Migraine Magnesium Calcium channel blockers
Stress Migraine Associated with Lead to loss Prevent Magnesium Calcium channel blockers Subtype-of (is a) Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 2: Classify relations between entities
Associated with Lead to loss Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 3: Do reasoning: find new correlations Stress Migraine Prevent Magnesium Calcium channel blockers Subtype-of (is a)
Associated with Lead to loss Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 4: Do reasoning: infer causality Stress Migraine No prevention Prevent Subtype-of (is a) Magnesium Calcium channel blockers Deficiency of magnesium migraine
Stress Migraine Magnesium Calcium channel blockers My research Information Extraction • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker
Stress Migraine Associated with Lead to loss Prevent Magnesium Calcium channel blockers Subtype-of (is a) My research Relation extraction
Cure? Prevent? Treatment Disease Side Effect? Information and relation extraction • Problems: • Given biomedical text: • Find all the treatments and all the diseases • Find the relations that hold between them
Hepatitis Examples • Cure • These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. • Prevent • A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs • Vague • Effect of interferon on hepatitis B
Two tasks • Relationship extraction: • Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text • Information extraction (IE): • Related problem: identify such entities
Outline of IE • Data and semantic relations • Quick intro to graphical models • Models and results • Features • Conclusions
Data and Relations • MEDLINE, abstracts and titles • 3662 sentences labeled • Relevant: 1724 • Irrelevant: 1771 • e.g., “Patients were followed up for 6 months” • 2 types of Entities • treatment and disease • 7 Relationships between these entities The labeled data are available at http://biotext.berkeley.edu
Semantic Relationships • 810: Cure • Intravenous immune globulin for recurrent spontaneous abortion • 616: Only Disease • Social ties and susceptibility to the common cold • 166: Only Treatment • Flucticasone propionate is safe in recommended doses • 63: Prevent • Statins for prevention of stroke
Semantic Relationships • 36: Vague • Phenylbutazone and leukemia • 29: Side Effect • Malignant mesodermal mixed tumor of the uterus following irradiation • 4: Does NOT cure • Evidence for double resistance to permethrin and malathion in head lice
Outline of IE • Data and semantic relations • Quick intro to graphical models • Models and results • Features • Conclusions
Graphical Models • Unifying framework for developing Machine Learning algorithms • Graph theory plus probability theory • Widely used • Error correcting codes • Systems diagnosis • Computer vision • Filtering (Kalman filters) • Bioinformatics
B C D (Quick intro to) Graphical Models • Nodes are random variables • Edges are annotated with conditional probabilities • Absence of an edge between nodes implies conditional independence • “Probabilistic database” A
B C D Graphical Models • Define a joint probability distribution: • P(X1, ..XN) = iP(Xi | Par(Xi) ) • P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) • Learning • Given data, estimate P(A), P(B|A), P(D), P(C | A, D) A
B C D Graphical Models • Define a joint probability distribution: • P(X1, ..XN) = iP(Xi | Par(Xi) ) • P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) • Learning • Given data, estimate P(A), P(B|A), P(D), P(C | A, D) A • Inference: compute conditional probabilities, e.g., P(A|B, D) • Inference = Probabilistic queries. General inference algorithms (Junction Tree)
Naïve Bayes models • Simple graphical model • Xi depend on Y • Naïve Bayes assumption: all Xi are independent given Y • Currently used for text classification and spam detection Y x1 x2 x3
Dynamic Graphical Models • Graphical model composed of repeated segments • HMMs (Hidden Markov Models) • POS tagging, speech recognition, IE tN wN
tN wN HMMs • Joint probability distribution • P(t1,.., tN,w1,..,wN) = P(t1) P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data
HMMs • Joint probability distribution • P(t1,.., tN,w1,..,wN) = P(t1) P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data • Inference: P(ti | w1 ,w2 ,…wN) tN wN
D1 S1 D2 S2 D3 Graphical Models for IE • Different dependencies between the features and the relation nodes Dynamic Static
Graphical Model • Relation node: • Semantic relation (cure, prevent, none..) expressed in the sentence • Relation generate the state sequence and the observations Relation
Graphical Model • Markov sequence of states (roles) • Role nodes: • Rolet {treatment, disease, none} Rolet-1 Rolet Rolet+1
Graphical Model • Roles generate multiple observations • Feature nodes (observed): • word, POS, MeSH… Features
Graphical Model • Inference: Find Relation and Roles given the features observed ? ? ? ?
Features • Word • Part of speech • Phrase constituent • Orthographic features • ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … • Semantic features (MeSH)
MeSH • MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]
1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02] Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..) Body Regions [A01] Abdomen [A01.047] Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….) MeSH (cont.)
Use of lexical Hierarchies in NLP • Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law) • Difficult to do statistics • One solution: use lexical hierarchies • Another example: WordNet • Statistics on classes of words instead of words
Mapping Words to MeSH Concepts • headache pain • C23.888.592.612.441 G11.561.796.444 • C23.888 G11.561 • [Neurologic Manifestations][Nervous System Physiology ] • C23 G11 • [Pathological Conditions, Signs and Symptoms][Musculoskeletal, Neural, and Ocular Physiology] • headache recurrence • C23.888.592.612.441 C23.550.291.937 • breast cancer cells • A01.236 C04 A11
Graphical Model • Joint probability distribution over relation, roles and features nodes • Parameters estimated with maximum likelihood and absolute discounting smoothing
Graphical Model • Inference: Find Relation and Roles given the features observed ? ? ? ?
Relation extraction • Results in terms of classification accuracy (with and without irrelevant sentences) • 2 cases: • Roles given • Roles hidden (only features)
Relation classification: Results • Good results for a difficult task • One of the few systems to tackle several DIFFERENT relations between the same types of entities; thus differs from the problem statement of other work on relations
Role Extraction: Results Junction tree algorithm F-measure = (2*Prec*Recall)/(Prec + Recall) (Related work extracting “diseases” and “genes” reports F-measure of 0.50)
Features impact: Role extraction • Most important features: 1)Word 2)MeSH Rel. + irrel. Only rel. • All features 0.71 0.73 • No word 0.61 0.66 -14.1% -9.6% • No MeSH 0.65 0.69 -8.4% -5.5%
Features impact: Relation classification • Most important features: Roles Accuracy • All feat. + roles 82.0 • All feat. – roles 74.9 -8.7% • All feat. + roles – Word 79.8 -2.8% • All feat. + roles – MeSH 84.6 3.1% (rel. + irrel.)
Features impact: Relation classification • Most realistic case: Roles not known • Most important features: 1) Word 2) Mesh Accuracy • All feat. – roles 74.9 • All feat. - roles – Word 66.1 -11.8% • All feat. - roles – MeSH 72.5 -3.2% (rel. + irrel.)
Conclusions • Classification of subtle semantic relations in bioscience text • Graphical models for the simultaneous extraction of entities and relationships • Importance of MeSH, lexical hierarchy
Outline of Talk • Goal: Extract semantics from text • Information and relation extraction • Protein-protein interactions; using an existing database to gather labeled data
Protein-Protein interactions • One of the most important challenges in modern genomics, with many applications throughout biology • There are several protein-protein interaction databases (BIND, MINT,..), all manually curated
Protein-Protein interactions • Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks. • Some other approaches: semi-supervised, active learning, co-training. • We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins
HIV-1, Protein Interaction Database • Documents interactions between HIV-1 proteins and • host cell proteins • other HIV-1 proteins • disease associated with HIV/AIDS • 2224 pairs of interacting proteins, 65 types http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions