1 / 30

Extracting information from scientific papers:

Extracting information from scientific papers:. Challenges and Opportunities for Researchers and Curators. DPB. Discussion Plan. What does a curator do? What do we ALL ( researches and curators ) want from the papers we read? What problems do we encounter when reading papers?

gagan
Télécharger la présentation

Extracting information from scientific papers:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB

  2. Discussion Plan • What does a curator do? • What do we ALL (researches and curators) want from the papers we read? • What problems do we encounter when reading papers? • Identifying items • Choosing annotations • How can we work together to improve these processes? • Why does this matter to YOU?

  3. What does a curator do? • It depends on the type of curator! • Functional genomics curator / Metabolic pathway curator: • Help to maintain the TAIR and Plant Metabolic Network / AraCyc websites • Answer questions from users • Give presentations and workshops at conferences and universities • Interact with curators at other institutions to develop better curation practices and tools • Read LOTS of papers

  4. What do we all want from papers?

  5. What do we all want from papers? • It depends on the type of paper! • I focus on papers that describe: • genes/proteins (TAIR and PMN) • metabolic pathways (PMN) • We all want the important information! • Curators also want to be able to capture that information and display it for users on the TAIR and AraCyc/PMN websites.

  6. What do we all want from papers? • What gene / protein are they talking about? • AGI locus code (TAIR / PMN) • At2g46990 • Gene symbol and FULL names (TAIR / PMN) • BSK3 = Brassinsteroid (BR)-signaling kinase 3 • GGT2 = Glutamate:Glyoxylate aminotransferase 2 • Gene model (TAIR) • At2g46990.1

  7. What do we all want from papers? • What does this gene do? • Molecular Function GO terms (TAIR) • has “protein kinase activity” - GO: 0004672 • functions in “histone binding” - GO: 0042393 • has “L-glutamine transmembrane transporter activity” - GO:0015186 • Phenotype description (TAIR) • “The ppc4-2 mutant has reduced PEP carboxylase activity” • Reactions catalyzed (PMN) • indole-3-acetonitrile + 2 H2O = ammonia + indole-3-acetate (IAA) • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)

  8. What do we all want from papers? • Where is this protein found? • Cellular Component GO terms (TAIR) • located in “nucleolus” - GO:0005730 • located in “TOC complex” - GO:0010006 • Cellular Ontology (PMN) • chloroplast • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)

  9. What do we all want from papers? • When and where is this gene / protein expressed? • Plant Structure PO terms (TAIR) • expressed in “anther” - PO:0009066 • Plant Growth Stages PO terms (TAIR) • expressed during “expanded cotyledon stage” - PO:0001078 • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)

  10. What do we all want from papers? • What biological processes does this protein participate in? • Biological Process GO terms (TAIR) • involved in “petal development” - GO:0048441 • involved in “L-glutamate import” - GO:0051938 • involved in “brassinosteroid biosynthetic process” - GO:0016132 • Metabolic Pathways (PMN) • put enzyme in “alanine degradation” pathway • Phenotype descriptions • “The phot1-4 mutant shows reduced responses to blue light” • Information for gene summaries (TAIR) • Information for enzyme summaries (PMN)

  11. What do we all want from papers? • What mutant(s) did they describe? (TAIR) • Mutant ID • SALK_nnnnnn • SAIL_21_A07 • Mutant name and unique symbol • rte1-2 (reversion-to-ethylene-sensitivity 1-2) • Ecotype • Ploidy level (e.g. heterozygous, homozygous) • Phenotype description

  12. What do we all want from papers? • What experiments did they do? • Assay conditions and reagents • Help curators • make GO and PO annotations (TAIR) • identify enzymatic reactions (PMN) • specific substrates, e.g. L-glutamate • necessary co-factors, e.g. Mg2+ • capture pH and temperature optimums (PMN) • We don’t capture: • PCR primers • good antibody sources • etc. • . . . but you are welcome to submit this information using “Comments”

  13. What do we all want from papers? A lot of important information . . . Gene identity Gene function Gene expression patterns and much more! Have you ever read a paper that’s missing important information? How did that make you feel? Did it interfere with your ability to do your work?

  14. Challenges : Identifying Objects • Case 1: Paper describes a gene or genes using a symbol • Authors never provide AGI code, sequence information, or other unique ID • Different genes can have the same symbols in TAIR • ASA: • Attenuated shade avoidance? • Anthranilate Synthase Alpha Subunit? • ARF1 • Auxin Response Factor 1? • ADP-Ribosylation Factor 1? • Not all symbols are in TAIR • Authors describe a new mutant or name a new gene family and never give IDs • Impossible for us to annotate / Impossible for you to do related experiments

  15. Challenges : Identifying Objects • Case 2: Paper does not specify gene model when appropriate • a. “The T-DNA insertion is in the third exon of TPK1” Which “third exon?” Which “TPK1?” b. “We expressed TPK1 in E.coli and saw activity” Which “TPK1?” c. “A TPK1:GFP fusion protein localizes to the nucleus”

  16. Challenges : Identifying Objects • Case 3: Not enough information is given about a mutant • “The phyb mutant had a longer hypocotyl than the wild type plant” • 30 alleles / germplasms associated with phyB in TAIR Which phyb? What ecotype?

  17. Challenges : Identifying Objects • Case 4: Not enough information is given about enzymatic reactions • Diagram in paper shows: arogenate  tyrosine • “In vitro, AR dehydrogenase catalyzed the formation of tyrosine from arogenate” D- or L-form of amino acid? What other substrates or products are involved? What oxidizing agent is involved? • “We detected the formation of arabidiol” What is the chemical structure of “arabidiol?”

  18. Opportunities : Identifying Objects • You are the next generation of: • Authors • Reviewers • Journal Editors • You can help each other and curators to identify all the important items in the • manuscripts you write or review • AGI locus code for all genes in paper (At2g46990) • Gene model information when relevant (At2g46990.1) • Specific mutant names (abc1-7), IDs (SALK_nnnnn) and ecotype • Complete and balanced biochemical reactions • Chemical structures or chemical database IDs for compounds • But, for curators, identifying objects is only one of the challenges . . .

  19. Challenges : Choosing annotations • Curators have to make decisions . . . • When should we make annotations? • What specific annotations should we make? • You should be concerned about how we “choose” annotations • You are data providers • We’re capturing the data from your papers • How would you like to see it presented? • You are data users • You use our annotations of individual genes • You analyze your microarray data using our GO and PO annotations • You view your transcript and metabolomic data using the OMICs viewer • How would you like to see it presented?

  20. Challenges : Choosing annotations – YOU make the call! • When and what should we annotate using GO terms?

  21. Challenges : Choosing annotations – YOU make the call! • Case 1: When is something “involved in” a biological process? • Molecular Function and Cellular Component annotations – pretty clear • Biological Process can be pretty ambiguous! • Glycine metabolic process • 6 mutants are uncovered that have altered levels of glycine • lgl1-1, lgl2-1, lgl3-1 make “Less GLycine” than wild-type plants • mgl1-1, mgl2-1, mgl3-1 make “More GLycine” than wild-type plants • Annotate all 6 genes: involved in “glycine metabolic process” • Use evidence code: IMP = inferred from mutant phenotype

  22. Challenges : Choosing annotations – YOU make the call! • Which genes are “involved in” – glycine metabolic process? ? MGL1 = F-box protein (E3 ligase subunit) degrades kinase ? MGL2 = phosphatase promotes E3 ligase activity ? LGL3 = tyrosine kinase turns on TF ? MGL3 = nucleoporin up-regulates enzyme allows phosphatase to enter nucleus ? LGL2 = transcription factor ? LGL1 = threonine aldolase ? ? ? • Where do we stop? • Should we change old annotations? (***Evidence code is important – be aware of IMP!) • What belongs in a GO annotation versus a phenotype description?

  23. Challenges : Choosing annotations – YOU make the call! • Case 2: How do we deal with over-expressers? RNAi? etc.? • What biological process is XYZ1 involved in? • 35S:XYZ1 • more petals than wild type plants • xyz1 KO mutants • normal number of petals • Is XYZ involved in “petal development?” • XYZ1 is “only expressed in roots” • XYZ1 is “expressed at very low levels in flowers” • XYZ1 – no expression data mentioned • What if XYZ is part of a large gene family? • What if XYZ is unique (not related to other genes)? ? ? ? ? ?

  24. Challenges : Choosing annotations – YOU make the call! • Case 3: When is it “enough” to make an annotation? • JKL is expressed in “rosette leaves” • “RT-PCR analyses show expression of JKL in rosette leaves” • “JKL is expressed at low levels in rosette leaves” • “JKL expression is barely detectable in rosette leaves” • GHI has enzymatic activity with the following substrates in vitro: • Which Molecular Functions do we annotate with GO in TAIR? • Which reactions do we add to AraCyc? • IAA + isoleucine -> IAA-Ile (90%) • IAA + leucine -> IAA-Leu (50%) • IAA + histidine -> IAA-His (20%) • IAA + cysteine -> IAA-Cys (5%) • IAA + proline -> IAA-Pro (1%) ? ? ? ? What if the reactions are characterized in vivo? ? ? ? ? ?

  25. Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support • Which genes are “expressed in” these tissues?

  26. Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support ? ? ? • “The expression of 11 genes was detected in leaves.”

  27. Challenges : Choosing annotations – YOU make the call! Case 5: Which term is “most” appropriate? • GRI (Grim Reaper) is involved in the regulation of extracellular ROS-induced cell death • “gri plants show increased ROS-induced cell death and reduced seed content.“ • “The seed content in siliques was reduced in gri and GRI overexpressors compared with Col-0 and vector control.“ Are the siliques shorter? ? • involved in “fruit development” Are there empty spaces in normal siliques? ? • involved in “seed development” Wrzaczek et al 2009

  28. Opportunities : Choosing annotations – YOU make the call! • You can be the annotators of the future! • informally : e-mail us or drop by and say hello! • use TAIR or PMN submission forms • during journal publication process • Plant Physiology (now) • more journals in the future!

  29. Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators • We all read papers • We all want to extract important and useful information from papers • We all want reliable annotations in our databases • Challenges: • Sometimes it is difficult to find the information we need in papers • Sometimes it is hard to judge how to curate data in papers • Opportunities: • Authors, reviewers, and editors can make sure that papers have adequate information • Curators can help researchers to directly submit annotations to TAIR or the PMN • Curators and researchers can communicate about the curation process • You know what we want • We know what you want! • We all work together to advance scientific research!

  30. Thank you! TAIR, AraCyc, and the PMN Eva Huala (Director and Co-PI) Sue Rhee (PI and Co-PI) Tech Team Members: - Bob Muller (Manager) - Larry Ploetz (Sys. Administrator) - Raymond Chetty - Anjo Chi - Vanessa Kirkup - Cynthia Lee - Tom Meyer - Shanker Singh - Chris Wilks Metabolic Pathway Software: - Peter Karp and SRI group Current Curators: - Tanya Berardini (lead curator – functional annotation) - David Swarbreck (lead curator – structural annotation) - Peifen Zhang (Director and lead curator- metabolism) - A. S. Karthikeyan (curator) - Philippe Lamesch (curator) • Donghui Li (curator) • Rajkumar Sasidharan (curator) Recent Past Contributors: - Debbie Alexander (curator) - Christophe Tissier (curator) - Hartmut Foerster (curator) NSF

More Related