410 likes | 630 Vues
Research in the Verspoor Lab. Text Mining. Information extraction from the biomedical literature Entity recognition and normalization Relation and event extraction Last time, I promised that we would look at: Ontologies as constraints for information extraction. Making BioNLP relevant.
E N D
Text Mining • Information extraction from the biomedical literature • Entity recognition and normalization • Relation and event extraction • Last time, I promised that we would look at: • Ontologies as constraints for information extraction
Making BioNLP relevant • Recognition of OBO terms, relations • CRAFT corpus (first release later this year)
OpenDMAP extracts typed relations from the literature • Concept recognition tool • Connect ontological terms to literature instances • Built on Protégé knowledge representation system • Language patterns associated with concepts and slots • Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, or outputs of other processing. • Linked to many text analysis engines via UIMA • Best performance in BioCreative II IPS task • >500,000 instances of three predicates (with arguments) extracted from Medline Abstracts • [Hunter, et al., 2008] http://bionlp.sourceforge.net
freetext ontology patterns OpenDMAP extracted information OpenDMAP
freetext ontology patterns OpenDMAP protein protein interaction: interactor1: cyclin E2 interactor2: cdk2 extracted information OpenDMAP Cyclin E2 interacts with Cdk2 in a functional kinase complex. <ontology> Protein protein interaction := [int1] interacts with [int2]
PROTÉGÉ ONTOLOGY CLASS: protein protein interaction SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule PATTERNS {c-interact} := [interactor1] interacts with [interactor2] {c-interact} := [interactor1] is bound by [interactor2] … OpenDMAP OpenDMAP
BioCreative II Example • Some BioCreative patterns for interact {c-interact} := [interactor1]{w-is}{w-interact-verb1}{w-preposition} the? [interactor2]; {w-is} := is, are, was, were; {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates, co-immunoprecipitated, co-localize, co-localizes, co-localized; {w-preposition} := among, between, by, of, with, to; • Matched text: PMID 16494873, SENT_ID 16494873_114 Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9wasco-immunoprecipitatedwithSOX10}. INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT {c-interact}:= [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2
BioCreative Results • 359 full-text articles in the test set • 385 interaction assertions produced • Performance averaged per article (to avoid dominance of a few assertion-heavy articles) P = 0.39, R = 0.31,F = 0.29 • Best result in the evaluation! • F score 10% higher than next-scoring system • F score > 3 standard deviations above mean • Recall 20% higher than next-scoring system
BioCreative conclusions • Information extraction in biomedical text is hard • Linguistic variability in how concepts are expressed • Complex concepts with multiple “slots” • OpenDMAP advances the state of the art • Use of an ontology grounds the search for information • Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)
BioNLP’09: Methods Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION]) Bax translocationto mitochondriafrom the cytosol Bax translocationfrom the cytosolto the mitochondria Slide credit: Kevin B. Cohen
BioNLP’09: Methods Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION]) Protein (Sequence Ontology) Cellular Component (Gene Ontology) Slide credit: Kevin B. Cohen
BioNLP’09: Methods Slide credit: Kevin B. Cohen
BioNLP’09: Methods • All event types represented as frames • Elements from ontology constrain every slot EVENT TYPE: REGULATION AtLoc: instance of biological_entity Cause: instance of protein CSite: instance of biological_concept or polypeptide_region Event_action: instance of trigger_word or detection_method Site: instance of biological_concept or polypeptide_region Theme: instance of protein or biological_process ToLoc: instance of biological_entity Sequence Ontology Molecular Interaction Ontology Gene Ontology Cell Cycle Ontology Slide credit: Kevin B. Cohen
BioNLP’09: Methods Partial view of ontology—reality is a little bit less clean Slide credit: Kevin B. Cohen
BioNLP’09: Methods BTO: BRENDA Tissue Ontology CCO: Cell Cycle Ontology CTO: Cell Type Ontology GO: Gene Ontology SO: Sequence Ontology Slide credit: Kevin B. Cohen
BioNLP’09: Methods • Manual pattern-writing • Before availability of training data: based on native speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004) • After release of training data: based on examination of corpus data, targeting high-frequency predicates only • Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement • Protein binding rules re-used from BioCreative II protein-protein interaction task • Eschewed use of wildcards Slide credit: Kevin B. Cohen
BioNLP’09: Results Task 1: P 10 points higher than second-highest Task 2: P 14 points higher than second-highest Task 3: P 3.4 points lower than highest (3/6) Slide credit: Kevin B. Cohen
BioNLP’09: Results Unofficial results: contribution of bug repairs Still the highest precision (#2 was 62.21) Slide credit: Kevin B. Cohen
BioNLP’09: Results • Contribution of coördination-handling • Bug-fixed results: F 27.62 (Task 1) • Without coordination-handling: F 24.72 • Decrease in F of 2.9 without coördination-handling Slide credit: Kevin B. Cohen
Syntax helps • 125I-labeled C3b was covalently deposited on CR2, when hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase> • CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule> • The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein> • Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>
More complex examples • Complex noun phrases • The inactive C3 (iC3), which forms spontaneously in serum in low amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2> • RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter> • Negation • TNP-BSA, however, did not bind to the CD4 receptor. <trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor> • Similarly, when cells expressing the wild type FSHR were treated with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>
Coordination isparticularly hard In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA. <mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa> Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin> The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>
BioNLP Shared Task ‘11 • Extension of BioNLP’09 tasks • Generalization to full text (from abstracts) • Additional event types: post-translational modifications and catalysis • Methods: • Based on empirically derived patterns • Derived from training data + manual refinement • Using dependency relations (syntax) • Work of Haibin Liu (postdoc)
Integrating background knowledge • Can improve OpenDMAP precision with minimal cost to recall • Take advantage of background knowledge • Tighten constraints on slot fillers in the ontology • No change to existing patterns • Proof of concept: • Distinguish among several types of protein activation (enzyme and receptor) in GeneRIFs • Utilize Gene Ontology annotations
Refining selectional restrictions TP: [GeneRIF 104155 ] an ER stress induces the activation of [caspase-12_protein - catalytic activity]activated_entity via [caspase-3_protein]activator prevented FP: [GeneRIF 105594] factor Xa can induce mesangial cell proliferation through the activation of ERK_proteinvia PAR2_protein in mesangial cells
Biological entities • Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest: • Diseases • Drugs, Chemicals, and other treatments • Anatomical and other locations • Time and temporal relationships • Methods and evidence • Molecular functions, biological processes
Two dictionary-based toolstested against CRAFT • UIMA ConceptMapper http://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator • stemming and case matching relaxation • non-contiguous spans • ignore stopwords • order-independent lookup • Open Biomedical Annotator http://bioportal.bioontology.org/annotator • ignore stopwords • partial word matches
Best run results • CM/CTO: stemming + FindAllMatches: false • OBA/CTO: using default stop words • CM/GO_CC: stemming + caseMatch: insensitive • CM/ChEBI: caseMatch: sensitive
Concept Matching Conclusions • The kinds of terms in the ontology matter • The strategies used in the dictionary matching tools matter • OpenDMAP will support strategies that go beyond dictionary matching …
Evaluation via Test Suite • Big picture: How to evaluate ontology concept recognition systems? • Traditional approach: “corpus” • Expensive • Time-consuming to produce • Redundancy for some things… • …underrepresentation of others • Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that: • Control test data • Eliminate redundancy • Systematic coverage (Oepen 1998) • Immediate (broad) goal of this work: Are there general principles for test suite design? Slide credit: Kevin B. Cohen
Methods • Steps: develop “catalogue” of dimensions along which terms vary • Use insights from linguistics and from how we know concept recognition systems work • Structural aspects: length • Content aspects: typography, orthography, lexical contents (function words)… • …to build a structured set of test cases • Also compare to other test suite work (Cohen et al. 2004) to look for common principles Slide credit: Kevin B. Cohen
Structured test suite Canonical Non-canonical GO:0000133 Polarisomes GO:0000108 Repairosomes GO:0000786 Nucleosomes GO:0001660 Fevers GO:0001726 Ruffles GO:0005623 Cells GO:0005694 Chromosomes GO:0005814 Centrioles GO:0005874 Microtubules • GO:0000133 Polarisome • GO:0000108 Repairosome • GO:0000786 Nucleosome • GO:0001660 Fever • GO:0001726 Ruffle • GO:0005623 Cell • GO:0005694 Chromosome • GO:0005814 Centriole • GO:0005874 Microtubule indution of apoptosis -> apoptosis induction (Syntax) cell migration -> cell migrated (Part of speech) ensheathment of neurons -> ensheathment of some neurons Slide credit: Kevin B. Cohen
Methods/Results • Gene Ontology, revision 9/24/2009 • Canonical: 188 • Non-canonical: 117 • Observation: • 5:1 “dirty” versus 5:1 “clean” is mark of “mature” testing • Applied publicly available concept recognition system Slide credit: Kevin B. Cohen
Results • 97.9% of canonical terms were recognized • All exceptions contain the word in • No non-canonical terms were recognized • What would it take to recognize the error pattern with canonical terms with a corpus-based approach?? • General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context Slide credit: Kevin B. Cohen