150 likes | 263 Vues
This document discusses advancements in the efficiency of literature curation for WormBase by incorporating user contributions and the semi-automation capabilities of Textpresso. It outlines the current curation pipeline, emphasizes the importance of user-submitted data, and explores the use of full-text searching and data extraction techniques. By examining case studies, the document highlights the significant improvements in data types such as genetic interactions and gene ontology annotations, ultimately aiming to streamline the curation process and promote transparency in curation statistics.
E N D
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation WormBase Literature Curators Textpresso SAB 2008
How does data get into WormBase? Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/ COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7...... User submission (email, web forms) First-pass curation SAB 2008
Current first-pass curation pipeline Publication Flagging/Triage Curation SAB 2008
User submissions: first-pass flagging/triage • Growing desire amongst biocurators for user submissions • First people to know what data is in a paper is the authors • TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter Submitter email Paper identifier Locus name Term/descriptor, method SAB 2008
Data extraction: Textpresso • Full-text searching • Keywords and/or categories Müller, Kenny, and Sternberg. PLoS Biology, November, 2004. SAB 2008
Textpresso: What data types? • Paper – entity association: pattern matching Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12 • Fact extraction: specialized categories Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099)background, but not noticeably in the weaker tra-1(e1076) background. GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows. SAB 2008
Textpresso-mediated CC curation: from sentences to annotations SAB 2008
Textpresso: How much data? Transgenes: 1,100 new paper-transgene connections 250 new transgenes checked manually – 95% accuracy ultimately, connections will go directly into database Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week) SAB 2008
Textpresso: Other data types How else can we use Textpresso? Other data types: Molecular Function Assays, Gene Product Interactions Pilot: GO molecular function annotations for protein kinase activity keyword: phosphorylate category: C. elegans proteins 13 new GO annotations/hour Extension of this: protein modifications – not yet captured in WB Pilot: Gene product interactions for WB and BIND keywords: physically interact category:C. elegans proteins 310 matches in 237 documents 22 physical interactions – top 15 papers
Textpresso for triage: Classifying text based on content • Multiple levels: • Organismal triage – C. elegans, Drosophila • Identify, prioritize information-rich papers • Flag for specific data types • Multiple strategies (using existing first-pass papers as training set): • Machine learning – SVM (Support Vector Machine) Word frequency analysis • Hand-crafted categories • Combine SVM and categories • Supplement with word weighting, contextual analyses
.....and making curation statistics more transparent to users. • Users could search for curation status of any paper • Users could search for curation status of a given data type • Each database release would report newly curated papers • Each database release would document increases in data-type curation SAB 2008
WormBase Literature Curation First Pass, Genetic Interactions: Andrei Petcherski, Caltech Gene Symbols, Alleles, Sequence Features, Mapping Data: Mary Ann Tuli, Sanger Gene Regulation, PWMs: Xiaodong Wang, Caltech Erich Schwarz, Caltech Expression Patterns, Antibodies, Transgenes: Wen Chen, Caltech Gene Function: Concise Descriptions, Gene Ontology: Ranjana Kishore, Caltech Erich Schwarz, Caltech Kimberly Van Auken, Caltech Anatomy Ontology, Cell Function: Raymond Lee, Caltech Microarrays, SAGE: Igor Antoshechkin, Caltech Mutant Phenotypes (RNAi and Alleles): Igor Antoshechkin, Caltech Jolene Fernandez, Caltech Raymond Lee, Caltech Gary Shindelman, Caltech Karen Yook, Caltech Curation Tools, Database: Juancarlos Chan, Caltech Sequence, Gene Structures: Sanger, Wash U Authors, Papers: Cecilia Nakamura, Daniel Wang