1 / 97

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations. Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley. Outline of Talk. Goal: Extract semantics from text Information and relation extraction Protein-protein interactions Noun compounds. Text Mining.

smcmakin
Télécharger la présentation

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley

  2. Outline of Talk • Goal: Extract semantics from text • Information and relation extraction • Protein-protein interactions • Noun compounds

  3. Text Mining • Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text

  4. Text Mining • Text: • Stress is associated with migraines • Stress can lead to loss of magnesium • Calcium channel blockers prevent some migraines • Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text

  5. Text Mining • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 1: Extract semantic entities from text Stress Migraine Magnesium Calcium channel blockers

  6. Stress Migraine Associated with Lead to loss Prevent Magnesium Calcium channel blockers Subtype-of (is a) Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 2: Classify relations between entities

  7. Associated with Lead to loss Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 3: Do reasoning: find new correlations Stress Migraine Prevent Magnesium Calcium channel blockers Subtype-of (is a)

  8. Associated with Lead to loss Text Mining (cont.) • Text: • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker 4: Do reasoning: infer causality Stress Migraine No prevention Prevent Subtype-of (is a) Magnesium Calcium channel blockers Deficiency of magnesium  migraine

  9. Stress Migraine Magnesium Calcium channel blockers My research Information Extraction • Stressis associated withmigraines • Stresscan lead to loss ofmagnesium • Calcium channel blockersprevent somemigraines • Magnesiumis a naturalcalcium channel blocker

  10. Stress Migraine Associated with Lead to loss Prevent Magnesium Calcium channel blockers Subtype-of (is a) My research Relation extraction

  11. Cure? Prevent? Treatment Disease Side Effect? Information and relation extraction • Problems: • Given biomedical text: • Find all the treatments and all the diseases • Find the relations that hold between them

  12. Hepatitis Examples • Cure • These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. • Prevent • A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs • Vague • Effect of interferon on hepatitis B

  13. Two tasks • Relationship extraction: • Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text • Information extraction (IE): • Related problem: identify such entities

  14. Outline of IE • Data and semantic relations • Quick intro to graphical models • Models and results • Features • Conclusions

  15. Data and Relations • MEDLINE, abstracts and titles • 3662 sentences labeled • Relevant: 1724 • Irrelevant: 1771 • e.g., “Patients were followed up for 6 months” • 2 types of Entities • treatment and disease • 7 Relationships between these entities The labeled data are available at http://biotext.berkeley.edu

  16. Semantic Relationships • 810: Cure • Intravenous immune globulin for recurrent spontaneous abortion • 616: Only Disease • Social ties and susceptibility to the common cold • 166: Only Treatment • Flucticasone propionate is safe in recommended doses • 63: Prevent • Statins for prevention of stroke

  17. Semantic Relationships • 36: Vague • Phenylbutazone and leukemia • 29: Side Effect • Malignant mesodermal mixed tumor of the uterus following irradiation • 4: Does NOT cure • Evidence for double resistance to permethrin and malathion in head lice

  18. Outline of IE • Data and semantic relations • Quick intro to graphical models • Models and results • Features • Conclusions

  19. Graphical Models • Unifying framework for developing Machine Learning algorithms • Graph theory plus probability theory • Widely used • Error correcting codes • Systems diagnosis • Computer vision • Filtering (Kalman filters) • Bioinformatics

  20. B C D (Quick intro to) Graphical Models • Nodes are random variables • Edges are annotated with conditional probabilities • Absence of an edge between nodes implies conditional independence • “Probabilistic database” A

  21. B C D Graphical Models • Define a joint probability distribution: • P(X1, ..XN) = iP(Xi | Par(Xi) ) • P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) • Learning • Given data, estimate P(A), P(B|A), P(D), P(C | A, D) A

  22. B C D Graphical Models • Define a joint probability distribution: • P(X1, ..XN) = iP(Xi | Par(Xi) ) • P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) • Learning • Given data, estimate P(A), P(B|A), P(D), P(C | A, D) A • Inference: compute conditional probabilities, e.g., P(A|B, D) • Inference = Probabilistic queries. General inference algorithms (Junction Tree)

  23. Naïve Bayes models • Simple graphical model • Xi depend on Y • Naïve Bayes assumption: all Xi are independent given Y • Currently used for text classification and spam detection Y x1 x2 x3

  24. Dynamic Graphical Models • Graphical model composed of repeated segments • HMMs (Hidden Markov Models) • POS tagging, speech recognition, IE tN wN

  25. tN wN HMMs • Joint probability distribution • P(t1,.., tN,w1,..,wN) = P(t1)  P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data

  26. HMMs • Joint probability distribution • P(t1,.., tN,w1,..,wN) = P(t1)  P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data • Inference: P(ti | w1 ,w2 ,…wN) tN wN

  27. D1 S1 D2 S2 D3 Graphical Models for IE • Different dependencies between the features and the relation nodes Dynamic Static

  28. Graphical Model • Relation node: • Semantic relation (cure, prevent, none..) expressed in the sentence • Relation generate the state sequence and the observations Relation

  29. Graphical Model • Markov sequence of states (roles) • Role nodes: • Rolet {treatment, disease, none} Rolet-1 Rolet Rolet+1

  30. Graphical Model • Roles generate multiple observations • Feature nodes (observed): • word, POS, MeSH… Features

  31. Graphical Model • Inference: Find Relation and Roles given the features observed ? ? ? ?

  32. Features • Word • Part of speech • Phrase constituent • Orthographic features • ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … • Semantic features (MeSH)

  33. MeSH • MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

  34. 1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02] Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..) Body Regions [A01] Abdomen [A01.047] Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….) MeSH (cont.)

  35. Use of lexical Hierarchies in NLP • Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law) • Difficult to do statistics • One solution: use lexical hierarchies • Another example: WordNet • Statistics on classes of words instead of words

  36. Mapping Words to MeSH Concepts • headache pain • C23.888.592.612.441 G11.561.796.444 • C23.888 G11.561 • [Neurologic Manifestations][Nervous System Physiology ] • C23 G11 • [Pathological Conditions, Signs and Symptoms][Musculoskeletal, Neural, and Ocular Physiology] • headache recurrence • C23.888.592.612.441 C23.550.291.937 • breast cancer cells • A01.236 C04 A11

  37. Graphical Model • Joint probability distribution over relation, roles and features nodes • Parameters estimated with maximum likelihood and absolute discounting smoothing

  38. Graphical Model • Inference: Find Relation and Roles given the features observed ? ? ? ?

  39. Relation extraction • Results in terms of classification accuracy (with and without irrelevant sentences) • 2 cases: • Roles given • Roles hidden (only features)

  40. Relation classification: Results • Good results for a difficult task • One of the few systems to tackle several DIFFERENT relations between the same types of entities; thus differs from the problem statement of other work on relations

  41. Role Extraction: Results Junction tree algorithm F-measure = (2*Prec*Recall)/(Prec + Recall) (Related work extracting “diseases” and “genes” reports F-measure of 0.50)

  42. Features impact: Role extraction • Most important features: 1)Word 2)MeSH Rel. + irrel. Only rel. • All features 0.71 0.73 • No word 0.61 0.66 -14.1% -9.6% • No MeSH 0.65 0.69 -8.4% -5.5%

  43. Features impact: Relation classification • Most important features: Roles Accuracy • All feat. + roles 82.0 • All feat. – roles 74.9 -8.7% • All feat. + roles – Word 79.8 -2.8% • All feat. + roles – MeSH 84.6 3.1% (rel. + irrel.)

  44. Features impact: Relation classification • Most realistic case: Roles not known • Most important features: 1) Word 2) Mesh Accuracy • All feat. – roles 74.9 • All feat. - roles – Word 66.1 -11.8% • All feat. - roles – MeSH 72.5 -3.2% (rel. + irrel.)

  45. Conclusions • Classification of subtle semantic relations in bioscience text • Graphical models for the simultaneous extraction of entities and relationships • Importance of MeSH, lexical hierarchy

  46. Outline of Talk • Goal: Extract semantics from text • Information and relation extraction • Protein-protein interactions; using an existing database to gather labeled data

  47. Protein-Protein interactions • One of the most important challenges in modern genomics, with many applications throughout biology • There are several protein-protein interaction databases (BIND, MINT,..), all manually curated

  48. Protein-Protein interactions • Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks. • Some other approaches: semi-supervised, active learning, co-training. • We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins

  49. HIV-1, Protein Interaction Database • Documents interactions between HIV-1 proteins and • host cell proteins • other HIV-1 proteins • disease associated with HIV/AIDS • 2224 pairs of interacting proteins, 65 types http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions

  50. HIV-1, Protein Interaction Database

More Related