1 / 64

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts. Robert Gaizauskas. Natural Language Processing Group Department of Computer Science. M. Ghanem, Tom Barnwell , Y. Guo Department of Computing. GO Tag: Assigning Gene Ontology Labels to Medline Abstracts. Robert Gaizauskas.

irma
Télécharger la présentation

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Robert Gaizauskas Natural Language Processing Group Department of Computer Science

  2. M. Ghanem, Tom Barnwell, Y. Guo Department of Computing GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Robert Gaizauskas N. Davis, Y.K. Guo, H. Harkema Natural Language Processing Group Department of Computer Science J. Ratcliffe

  3. Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

  4. Project Background • Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5) • Both projects • have developed text mining and data analysis components -- complementary approaches NLP vs. datamining/statistical analysis • workflow models for co-ordinating distributed services • working on life science applications • Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid • Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework • Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology NaCTeM Seminar

  5. The Gene Ontology • “The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism”http://www.geneontology.org/ • Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated: • biological processes • cellular components • molecular functions in a species-independent manner • E.g. gene product cytochrome c can be described by • the molecular function term electron transporter activity • the biological process terms oxidative phosphorylation and induction of cell death • the cellular component terms mitochondrial matrix and mitochondrial inner membrane NaCTeM Seminar

  6. Gene Ontology (cont) From: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29. NaCTeM Seminar

  7. The Gene Ontology (cont) • Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD)) • GO now (08/11/05) contains 19022 terms • GO Slim(s) are reduced versions of GO ontologies containing a subset of GO terms • Aim to give a broad overview of ontology content • GO Slim Generic currrently contains 127 terms • A typical GO Term Term name: isotropic cell growth Accession: GO:0051210 Ontology: biological_process Synonyms: related: uniform cell growth Definition: “The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.” … NaCTeM Seminar

  8. Evidence code Gene Go Annotation Reference TAS : Traceable Author Statement structural constituent of cytoskeleton Botstein D, et al. (1997) The yeast cytoskeleton. Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. I. Establishment and maintenance Botstein D, et al. (1997) The yeast cytoskeleton. TAS : Traceable Author Statement ACT1 exocytosis Galarneau L, et al. (2000) Multiple links between the NuA4 histone acetyltransferase complex and epigenetic control of transcription IDA : Inferred from Direct Assay histone acetyltransferase complex GO Annotation in Model Organism DB’s • Model organism db’s typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code • E.g. from the Saccharomyces Genome Database (SGD) IC: Inferred by Curator IDA: Inferred from Direct Assay IEA: Inferred from Electronic Annotation IEP: Inferred from Expression Pattern IGI: Inferred from Genetic Interaction IMP: Inferred from Mutant Phenotype IPI: Inferred from Physical Interaction ISS: Inferred from Sequence or Structural Similarity NAS: Non-traceable Author Statement ND: No biological Data available RCA: inferred from Reviewed Computational Analysis TAS: Traceable Author Statement NR: Not Recorded NaCTeM Seminar

  9. PubMed • PubMed • on-line bibliographic database designed to provide access to citations from biomedical literature • developed by the US NCBI at the NLM • Contains Medline, OldMedline, various other sources • Medline • Over 12 million citations dating back to 1960’s • Author abstracts and citations from > 4800 biomedical journals NaCTeM Seminar

  10. PubMed • Entrez is NCBI’s integrated, text-based search and retrieval system for the major databases it maintains NaCTeM Seminar

  11. Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

  12. User Types • Research Geneticists • Narrow information interest • Particular gene • Particular activity/functionality • Model Organism Genome DB Curators • Broader information interest • Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level NaCTeM Seminar

  13. User Scenarios: Research Geneticist • Possible scenarios using GO tagging to support a research geneticist include: • Search result presentation: • Tag abstracts returned from a PubMed search with GO codes • Use GO codes to cluster/structure search results to support more effective information access • Structuring of related literature as workflow side-effect • Many typical researcher workflows involve BLAST searches yielding BLAST/Swissprot reports • Workflow can automatically assemble a set of “related” papers by extracting PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity • Resulting abstract set can be clustered/structured by GO terms and presented to researcher (Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.) NaCTeM Seminar

  14. Search Result Presentation: Motivating Example • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) • Putting LIM Kinase into Entrez gives 146 possible papers of interest. NaCTeM Seminar

  15. Search Result Presentation: Motivating Example • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) • Putting LIM Kinase into Entrez gives 146 possible papers of interest. • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding NaCTeM Seminar

  16. Search Result Presentation: Motivating Example • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding NaCTeM Seminar

  17. User Scenarios: Model Organism DB Curator • Possible scenarios using GO tagging/text mining to support DB curators include: • Help assemble texts that may support GO code assignment • GO tag texts in curator’s watching brief • Automated tagging could act as prompt for/check on curator’s judgement • Help to determine gene-GO term pairs for annotation • Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates • Perform GO tagging/gene name identification at sentence level and suggest candidates • Attempt to assign GO evidence codes • To text segments providing evidence for GO code assignment without identifying GO code/gene pair to which the evidenced pertains • To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains NaCTeM Seminar

  18. Possible Tasks (1) Assigning GO codes to abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned) • Note in this task there is no association of GO code with any specific gene/gene product NaCTeM Seminar

  19. Possible Tasks (2) Assigning GO codes to genes/gene products in abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments • This capability would support additional tasks • Given a particular gene/gene product and a text collection, find all GO codes for the gene/gene product across the collection • Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection NaCTeM Seminar

  20. Possible Tasks (3) Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: As in Task 2. but additionally supply the evidence codes • A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code NaCTeM Seminar

  21. Related Work • Raychaudri, Chang, Sutphin & Altman (2002) • Task: associate GO codes with genes by • Associating GO codes with papers • Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them • Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches • Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories • Results: maximum entropy best -- 72.8% classification accuracy over 21 categories NaCTeM Seminar

  22. Related Work • Go-KDS (Smith & Cleary, 2003) • Product of Reel Two • Task: assign arbitrary GO terms to PubMed articles • Method: • Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features • trained on gene/protein DB’s which use GO codes AND have links to Medline • Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy NaCTeM Seminar

  23. Related Work • GoPubMed • On-going work at Dresden University (www.gopubmed.org) • Task: Annotate PubMed abstracts with GO terms • Method: Use a local sequence alignment algorithm with weighted term matching (to overcome limits of strict matching) between GO terms and strings in texts • Evaluation: None reported • Kiritchenko et al. (U. of Ottawa) • Task: assign arbitrary GO terms to biomedical texs • Method: • Treat task as hierarchical text classification • use AdaBoost.MH • Evaluation: • introduce hierarchical evaluation measure • Results unclear NaCTeM Seminar

  24. Related Work (cont) • Biocreative challenge -- task 2 contained three related subtasks • Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment • Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article • Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated) • Results indicated no systems ready for practical use • Issues: lack of training data; complexity of tasks NaCTeM Seminar

  25. Related Work (cont) • TREC Genomics Track 2004 -- three tasks related to GO code assignment • Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated • Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes • Task 2 plus provide evidence code supporting each gene-GO hierarchy label association • Results for all three tasks poor NaCTeM Seminar

  26. Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

  27. Data Sets and Evaluation • In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed • However, no such corpus exists … NaCTeM Seminar

  28. Data Sets and Evaluation • Solution 1: SGD Gold Standard • Derive a corpus from SGD model organism database (yeast) • Assemble all Medline abstracts cited as evidence supporting assignment of GO terms • Associate with each abstract the GO term whose assignment it is cited as supporting • I.e. given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T • SGD Gold Standard • 4922 PMIDS • 2455 GO terms • 10485 PMID-GO term pairs NaCTeM Seminar

  29. Data Sets and Evaluation: SGD “Gold” Standard • Advantages: • Data already exists -- no extra annotation work required • Can assemble similar corpora for each model organism DB • Disadvantages: • Each abstract has associated with it GO terms whose assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it • Not every paper supporting a GO term assignment will be cited • Consequence: • SGD gold standard is “GO term incomplete” • Weak measure of recall • Precision figures difficult to interpret NaCTeM Seminar

  30. Data Sets and Evaluation: SGD “Gold” Standard • Further issue: • SGD Gene-GO term assignments are based on full papers, whereas system only has access to abstracts • Consequence: • Limit on maximum Recall obtainable by system NaCTeM Seminar

  31. Data Sets and Evaluation (cont) • Solution 2: IC Gold Standard • Manually extend the GO annotation of abstracts derivable from the SGD • Goal: GO term complete gold standard • Selected a subset (~800) for which support for all the assigned GO codes is found in the abstract (rather than the full paper) • Manually added additional GO annotations using a combination of fuzzy maching against GO and some manual addition of synonyms during checking • For included terms, include lowest within each ontology ‘cell wall biosynthesis’ => ‘cell wall biosynthesis’‘cell wall’ • Also applied same methodology to evidence paragraphs -- brief summaries written by curators deliberately using GO vocabulary • IC Gold Standard • 785 PMIDS • 1006 GO terms • 5170 PMID-GO term pairs NaCTeM Seminar

  32. Data Sets and Evaluation (cont) • Advantages: • Much closer to a GO-term complete gold standard • Disadvantages • Still not GO-term complete • Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms) • Gold Standard creation method favors lexical look-up approach to GO-tagging • Dataset is small NaCTeM Seminar

  33. Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

  34. The Go Tagging Task Addressed • The approaches we investigated all considered Task 1, as defined earlier: • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned) NaCTeM Seminar

  35. Approach 1: Lexical Lookup Using Termino • Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation • Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO • Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database • The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser • Available as a Web Service (see http://nlp.shef.ac.uk) NaCTeM Seminar

  36. Existing Terminological Resources Raw Texts Neurofibromin GO annotations: - 0008181: tumor supressor - 0005737: cytoplasma - … Medline Abstracts UMLS GO … Mastectomy UMLS data: - CUI: C0024881 - semantic type: therapeutic or preventive procedure - synonyms: mammectomy - … Source-Specific Loaders TermDB Term Induction Peptidyl-prolyl isomerase - type: protein term - source: induced from Medline - … Finite State Look-Up Term Parser Termino Text out Text in Termino Terminology Engine NaCTeM Seminar

  37. Lexical Look-Up for GO Tag • Termino • Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total) • Go term synonyms • SGD yeast gene names • Recognition of terms in text • Case-insensitive • “Simple” morphological variants are recognized • Cells mapped onto cell • Mitochondrial, mitochondria not mapped onto mitochondrion NaCTeM Seminar

  38. Lexical Look-Up for GO Tag (cont) • GO code assignment • GO term T is assigned to text iff name of T is recognized in text • Extensions: • GO term T is assigned to a paper if synonym ofterm T occurs in the abstract of the paper • GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper NaCTeM Seminar

  39. Lexical Lookup Results for GO Slim NaCTeM Seminar

  40. Lexical Lookup Results for GO Full NaCTeM Seminar

  41. Lexical Lookup Approach: Discussion • Recall • Effect of curators using full text (SGD) vs. abstracts only (IC) • Inherent drawbacks of lexical look-up: term variation, literal mentions • Effects of Gold Standard creation method (IC) • Precision • Effects of Gold Standard creation method (IC) • GO vs. GO Slim • Recognizing GO Slim terms is easier than recognizing GO terms • Effects of extensions (synonyms/gene names) on performance • Adding synonyms: variable decrease in Precision, substantial increase in Recall • Adding yeast terms: substantial decrease in Precision, substantial increase in Recall NaCTeM Seminar

  42. Error Analysis • False negatives for abstracts: • Abbreviation: mismatch repair(GO name) vs. MMR (in text) • Permutation, derivation: regulation of translation vs. regulated translation, sporulationvs.sporulate • Truncation: galactokinase activity vs. galactokinase • Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuolevs. autophagosomal NaCTeM Seminar

  43. Approach 2: IR-based Vector Space Similarity • Document Collection • Build a collection of “GO documents” where each GO document consists of GO term, its synonyms and its definition sentence • Query • Treat each abstract to which GO codes are to be assigned as a query against the GO document collection • Retrieval • Given a query (i.e abstract) retrieve relevant “GO documents” (i.e. GO terms) • assign top 1, 5, 10 … GO terms to an abstract which are most “similar” as measured by Vector Space Model(VSM) NaCTeM Seminar

  44. IR-based Approach • indexed the GO documents using Lucene search engine • Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming • 4 Indices were built according varying as to whether they used • Standard GO or GO Slim • A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms; • Used standard weighting scheme included in Lucene • Postprocessing: • Re-weighting: give credit to duplicated GO documents (found on more than one path back to root) • Threshold: the number of relevant GOIDs to return NaCTeM Seminar

  45. IR-Based Results • Better performance on IC abstracts than on SGD abstracts • Hierarchical documents do slightly worse than flat documents • Discriminatory effect of specific GO terms may be reducedby occurrence of general terms such as cell and protein NaCTeM Seminar

  46. Approach 3: Machine Learning • Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, … • Naïve Bayes predicts only one GO term per abstract • SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract • Features: words, frequent phrases • Preprocessing steps: tokenization, removal ofstop words, stemming • Training on 66% of annotated data, evaluation on remainder of data • GO term assignments vis-à-vis generic GO Slim to mitigate data sparsity problems NaCTeM Seminar

  47. Machine Learning Results • One GO term vs. multiple GO terms per abstract makes a difference • Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in text not be assigned if GO terms not present in training set • Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation between words in text and words in GO terms NaCTeM Seminar

  48. Comparison of Approaches • Best F scores for GO Slim • SGD Gold Standard • IC Gold Standard NaCTeM Seminar

  49. Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

  50. Input keywords here Upload a file containing a list of Medline abstracts Type/paste free texts to get results NaCTeM Seminar

More Related