1 / 63

NLP for Biomedicine - Ontology building and Text Mining -

NLP for Biomedicine - Ontology building and Text Mining -. Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate School of Information Science and Technology University of Tokyo JAPAN. My Talk

hilda-horne
Télécharger la présentation

NLP for Biomedicine - Ontology building and Text Mining -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP for Biomedicine- Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate School of Information Science and Technology University of Tokyo JAPAN

  2. My Talk • Background : Why NLP in Biomedicines • Examples of NLP in Biomedicines • Text Mining and NLP • Our current Work • 4.1 Terms and NE • 4.2 Resource Building • 4.3 Event Recognition • 5. Concluding Remarks

  3. My Talk • Background : Why NLP in Biomedicines • Examples of NLP in Biomedicines • Text Mining and NLP • Our current Work • 4.1 Terms and NE • 4.2 Resource Building • 4.3 Event Recognition • 5. Concluding Remarks

  4. Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

  5. Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

  6. Genome sequencing. by D. Devos

  7. Sequence, structure and function Information Exploitation Function Sequence Structure

  8. Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

  9. Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

  10. Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

  11. Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

  12. Statistical Biases Interpretation based on Knowledge Grammar Syntax-Semantic Mapping Knowledge Acquisition Machine Learning Huge Ontology: Next Revolution ? Bio-Medical Application: UMLS, Gene Ontology, etc. Revolution in LT in the last decade Information Language Texts Knowledge

  13. My Talk • Background : Why NLP in Biomedicines • Examples of NLP in Biomedicines • Text Mining and NLP • Our current Work • 4.1 Terms and NE • 4.2 Resource Building • 4.3 Event Recognition • 5. Concluding Remarks

  14. What can we do in Biomedical domains by NLP ? Examples

  15. Protein-Protein Interaction extracted from texts by C. Blaschke

  16. Organized Knowledge through terms by C. Blaschke

  17. From Data to Understanding:Interpretation by Language Oliveros, Blaschke et al., GIW 2000

  18. Information Extraction from Texts QA Answering Systems

  19. Characteristics of Signal Pathway (1) • Granularity of Knowledge Units Different types of entities which are interrelated with each other Cells, Sub-locations of cells Proteins, substructures of proteins, Subclasses of proteins Ions, other chemical substances Genes, RNA, DNA G-protein coupled receptor pathway model figure from TRANSPATH

  20. CSNDB(National Institute of Health Sciences) • A data- and knowledge- base for signaling pathways of human cells. • It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. • Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. • CSNDB is constructed on ACEDB and inference engine CLIPS , and has a linkage to TRANSFAC. • Final goal is to make a computerized model for various biological phenomena.

  21. Example. 1 Excerpted @[Takai98] • Signal_Reaction: “EGF receptor  Grb2” • From_molecule “EGF receptor” • To_molecule “Grb2” • Tissue “liver” • Effect “activation” • Interaction “SH2+phosphorylated Tyr” • Reference [Yamauchi_1997] • A Standard Reaction

  22. Example. 3 Excerpted @[Takai98] • A Polymerization Reaction • Signal_Reaction: “Ah receptor + HSP90 ” • Component “Ah receptor” “HSP90” • Effect “activationdissociation” • Interaction “PAS domain of Ah receptor” • Activity “inactivation of Ah receptor” • Reference [Powell-Coffman_1998]

  23. My Talk • Background : Why NLP in Biomedicines • Examples of NLP in Biomedicines • Text Mining and NLP • Our current Work • 4.1 Terms and NE • 4.2 Resource Building • 4.3 Event Recognition • 5. Concluding Remarks

  24. Observed Data Theories in Science Observable Non-Observable Data Mining

  25. Observable Knowledge In Mind Descriptions Of Knowledge Objects of Science Observed Data Non-Observable Qualitative, Structures, Classification Mathematical Formula Texts Ontology Quantitative Data

  26. Knowledge In Mind Descriptions Of Knowledge Objects Of Science Non-Observable Observable Natural Language Incomplete System Diversity Ambiguity

  27. Observed Data Theories in Science Observable Non-Observable Data Mining

  28. Knowledge In Mind Descriptions Of Knowledge Objects of Science Observed Data Non-Observable Observable Qualitative, Structures, Classification Mathematical Formula Texts Ontology Data Mining + Text Mining Quantitative Data

  29. Objects of science Descriptions of Knowledge Knowledge in Mind Non-Observable Observable Characteristics Of Knowledge Data Mining Text Mining Characteristics Of Language

  30. Knowledge In Mind Descriptions Of Knowledge Objects Of Science Non-Observable Observable Natural Language Incomplete System Diversity Ambiguity

  31. Knowledge In Mind Descriptions Of Knowledge Objects Of Science Non-Observable Observable Natural Language Incomplete System Diversity Ambiguity

  32. My Talk • Background : Why NLP in Biomedicines • Examples of NLP in Biomedicines • Text Mining and NLP • Our current Work • 4.1 Terms and NE • 4.2 Resource Building • 4.3 Event Recognition • 5. Concluding Remarks

  33. Terms are the basic units of knowledge Classification, Features NE recognition Event Recognition Semantic Disambiguation

  34. Linking Problem Diversity Lexicon Static Processing Term Recognition Ambiguity Context Dependent Dynamic Processing Task difficulties in molecular-biology • Inconsistent naming conventions • e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 • NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … • Wide-spread synonymy • Many synonyms in wide usage, e.g. PKB and Akt • cycline-dependent kinase inhibitor p27, p27kip1 • <cdc25, cdc25a>, <p52shc, p52(Shc)> • Open, growing vocabulary for many classes • Cross-over of names between classes depending on context • Protein vs DNA • Frequent uses of coordination inside term formations

  35. Ambiguity • Abbreviation Extraction (Schwartz 2003) • Extracts short and long form pairs

  36. Experiment[Tsuruoka, et.al. 03 SIGIR] • Corpus • MEDLINE: the largest collection of abstracts in the biomedical domain • Rule learning • 83,142 abstracts • Obtained rules: 14,158 • Evaluation • 18,930 abstracts • Count the occurrences of each generated variant.

  37. Results: “NF-kappa B”

  38. Results: “antiinflammatory effect”

  39. Results: “tumour necrosis factor alpha”

  40. Linking Problem Diversity Lexicon Static Ptocessing Term Recognition Ambiguity Context Dependent Dynamic Processing Task difficulties in molecular-biology • Inconsistent naming conventions • e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 • NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … • Wide-spread synonymy • Many synonyms in wide usage, e.g. PKB and Akt • cycline-dependent kinase inhibitor p27, p27kip1 • <cdc25, cdc25a>, <p52shc, p52(Shc)> • Open, growing vocabulary for many classes • Cross-over of names between classes depending on context • Protein vs DNA • Frequent uses of coordination inside term formations

  41. GeniaOntologySubstance +substance-+-compound-+-organic-+-nucleic_acid-+-poly_nucleotides | | | | +-nucleotide | | | | +-DNA | | | | +-RNA | | | +-amino_acid-+-peptide | | | | +-amino_acid_monomer | | | | +-protein | | | +-lipid | | | +-carbohydrate | | | +-other_organic_compounds | | +-inorganic | +-atom

  42. Genia Ontology:Source +-source-+-natural-+-organism-+-multi_cell | | | +-mono_cell | | | +-virus | | +-body_part | | +-tissue | | +-cell_type | +-artificial-+-cell_line | +-other_artificial_sources

  43. Number of Tagged Objects • Texts: 2,500 MEDLINE Abstracts • Papers on Transcription Factors in Human blood cells • 550,000 words, 20,000 sentences • Tagged objects: 147,000 • Protein: ~ 77,000 • DNA: ~ 24,000 • RNA: ~ 2,400 • Source: ~ 27,000 • Other: ~ 37,000

  44. Distributions of Semantic Classes

  45. Extension of GENIA Ontology • Small classes (to be embedded in UMLS) • 5242 terms labelled with ‘other_names’ class • Events, Biological reactions 3800 • Disease 636 • Names of Diseases 501 • Treatments 61 • Diagnoses 52 • Pathology 3 • Others 39 • Experiments 578 • Methods 493 • Materials 25 • Others 60 • Others 228

  46. PROTEIN DNA CELLTYPE DNA Identify andclassify Biomedical NE Task(Collier Coling00,Kazama ACL02, Kim ISMB02) • Recognize “names” in the text • Technical terms expressing proteins, genes, cells, etc. Thus, CIITA not only activates the expression of class II genes but recruits another B cell-specific coactivator to increase transcriptional activity of class II promoters in B cells .

  47. NE Task as Classification • To a class (tag) representing the semantic class and the position in the term • The task is reduced to a tagging task • We can use methods developed for tagging • The structure is encoded in a tag • BIO (Begin, Inside, and Other) tagging Term of class Y Term of class X … Words: o B-X I-X I-X o o B-Y o o BIO tags: (OTHER)

  48. NE Tagging Illustrated activity of class II promoters in Words: • Classify a word depending on the context POS tags: P N Sym Ns P N conversion to features context classifier BIO tags: O O B-DNA I-DNA Deterministic tagging: - Only the most probable tag at each word (SVM) The Viterbi tagging: - The most probable sequence among all (probabilistic models)

More Related