1 / 73

Pathway/Genome Database

Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International pkarp@ai.sri.com. Pathway/Genome Database. Integrating Genomic and Biochemical Data. Pathways. Reactions. Compounds. Proteins. Operons, Promoters,

darcie
Télécharger la présentation

Pathway/Genome Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the Pathway Tools Software and Pathway/Genome DatabasesPeter D. KarpBioinformatics Research GroupSRI Internationalpkarp@ai.sri.com

  2. Pathway/Genome Database Integrating Genomic and Biochemical Data Pathways Reactions Compounds Proteins Operons, Promoters, DNA Binding Sites Genes Chromosomes, Plasmids CELL

  3. Key Functionality • Pathway analysis • Prediction of pathways from genomes • Comparative pathway analysis • Ongoing curation of PGDBs • WWW publishing of PGDBs • Analysis of gene expression data

  4. Tools and Datasets Pathway/Genome Navigator Visualize, Query and Analyze PGDBs PathoLogic Editors PGDB Pathways Genes Create PGDBs Update PGDBs

  5. PathoLogic Pathway Predictor Set of Annotated Genes MetaCyc PGDB Pathway Prediction Reports New PGDB

  6. Prediction of Pathways from Genomes Pathway/Genome Database Annotated Genome Metabolic Network List of Gene Products PathoLogic Pathways List of Genes/ORFs Reactions Compounds DNA Sequence Proteins Genes Genomic Map

  7. MetaCyc Overview • Meta Metabolic Encyclopedia • 439 pathways, 1095 enzymes, 4217 reactions • 173 E. coli pathways • Literature-based DB with extensive references and commentary • Pathways, reactions, enzymes, substrates • Editor in chief: Dr. Monica Riley

  8. Pathway/Genome Navigator • Query and visualization tools for PGDBs • Metabolic pathways, reactions, compounds • Enzymes, transporters, transcription factors • Genome maps, genes, operons, promoters, DNA sites • Retrieve nucleotide and DNA sequences • Perform Blast searches • Runs as an application on Solaris, Windows • Runs as a WWW server on Solaris • Query and comparative analysis functions

  9. Interactive Editing Tools • Pathway editor • Reaction editor • Gene editor • Enzyme editor • Compound editor • Transcription Unit Editor • Facilitate updates to PGDBs • Improved computational predictions • Literature-based data • Record citations, comments, evidence, history

  10. Pathway Views of Expression Data • Import gene expression data • Compute expression ratios • Obtain pathway based visualizations of data • Numerical spectrum of expression values mapped to a color spectrum • Steps of overview painted with color corresponding to expression level(s) of genes that encode enzyme(s) for that step • Absolute or relative expression values

  11. Environment for Computational Exploration of Genomes • Powerful ontology opens many facets of the biology to computational exploration • Global characterization of metabolic network • Analysis of interface between transport and metabolism • Nutrient analysis of metabolic network

  12. PathoLogic Pathway Predictor

  13. Pathologic Pathway Predictor • Introduction • Description of PPP execution • Inputs to PPP • Using the GUI to create a pathway/genome database • Output from PPP • Caveats

  14. PathoLogic Goals • Create the set of class frames that encode DB schema • Copied from MetaCyc • Create the appropriate set of instance frames • Genes, genetic elements, proteins created from input files • Substrates, reactions, and pathways are copied from the reference database • Interconnect frames in a manner that accurately reflects their semantic relationships

  15. PathoLogic Input/Output • Inputs: • File listing genetic elements • http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat • Files containing DNA sequence for each genetic element • Files containing annotation for each genetic element • MetaCyc database • Output: • Pathway/genome database for the subject organism • Directory tree for the subject organism • Reports that summarize: • Evidence contained in the input genome for the presence of reference pathways • Reactions missing from inferred pathways

  16. Inputs to PathoLogic Pathway Predictor • genetic-elements.dat • Sequence files • GenBank file format • PathoLogic format • Directory Structure

  17. genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //

  18. File Naming Conventions • One pair of sequence and annotation files for each genetic element • Sequence files: FASTA format • suffix fsa or fna • Annotation file: • Genbank format: suffix .gbk • PathoLogic format: suffix .pf

  19. GenBank File Format • Accepted feature types: • CDS, tRNA, rRNA, misc_RNA • Accepted qualifiers: • /label Unique ID [recm] • /gene Gene name [req] • /product [req] • /EC_number [recm] • /product_comment [opt] • /gene_comment [opt] • /alt_name Synonyms [opt] • For multifunctional proteins, put each function in a separate /product line

  20. Typical Problems Using Genbank Files With PathoLogic • Wrong qualifier names used • Extraneous information in a given qualifier • Check results of trial parse carefully

  21. PathoLogic File Format • Each record starts with line containing an ID attribute • Tab delimited • Each record ends with a line containing // • One attribute-value pair is allowed per line • Use multiple FUNCTION lines for multifunctional proteins • Lines starting with ‘;’ are comment lines • Valid attributes are: • ID, NAME, SYNONYM • STARTBASE, ENDBASE, GENE-COMMENT • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT • DBLINK

  22. PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P

  23. Using the PPP GUI to Create a Pathway/Genome Database • Input Project Information • Organism -> Create New • Trial Parse • Build -> Trial Parse • Build pathway/genome database • Build -> Automated Build • Manual polishing • Refine -> Resolve Ambiguous Name Matches • Refine -> Assign Modified Proteins • Refine -> Create Protein Complexes • Refine -> Run Consistency Checker • Refine -> Update Overview

  24. Organism Select Create New Save KB Revert KB Reinitialize KB Exit Build Trial Parse Automated Build Refine Resolve Ambiguous Name Matches Assign Modified Proteins Create Protein Complexes Re-run Name Matcher Rescore Pathways Run Consistency Checker Update Overview PathoLogic Command Menus

  25. Input Project Information

  26. PathoLogic PP Parse Output

  27. Enzyme Name to Reaction Mapping

  28. Enzyme Name Matching Tool • Dictionary of enzyme names assembled from: • All metabolic reactions found in MetaCyc • Two files that map synonyms not found in MetaCyc to reaction names: • System file (pangea-enzyme-mappings.dat) • User-supplied file (local-enzyme-mappings.dat) • Location of sources: • $GPROOT/pathologic/$VERSION-NUMBER/data

  29. Enzyme Name Matcher • Matches on full enzyme name • Match is case-insensitive and removes the punctuation characters “ -_(){}',:” • Also matches after removal of prefixes and suffixes such as: • “Putative”, “Hypothetical”, etc • alpha|beta|…|catalytic|inducible chain|subunit|component • Parenthetical gene name

  30. Enzyme Name Matcher • For names that do not match, software identifies probable metabolic enzymes as those • Containing “ase” • Not containing keywords such as • “sensor kinase” • “topoisomerase” • “protein kinase” • “peptidase” • Etc • Research unknown enzymes • MetaCyc, Swiss-Prot, PIR, Medline, EMP

  31. Assigning Evidence Scores to Predicted Pathways • X|Y|Z denotes score for P in O • where: • X = total number of reactions in P • Y = enzymes catalyzing number of reactions for which there is evidence in O • Z = number of Y reactions that are used in other pathways in O • Not clear how to convert these scores into a probability of occurrence

  32. Algorithm for Automated Pathway Pruning • A pathway will never be pruned if it contains a unique enzyme – an enzyme not present in any other pathway • A pathway will be pruned if one of the following conditions holds: • Evidence is better for a different pathway in same variant set • Evidence for only one reaction in pathway, or • Its set of reactions present is a proper subset of the reactions present in some other pathway, and • If pathway is a biosynthetic pathway, final reaction(s) missing • If pathway is a degradation pathway, initial reaction(s) missing • If pathway is an energy metabolism pathway, more than half the reactions are missing

  33. Creating Protein Complexes

  34. Complex Subunits Stoichiometries

  35. Proteins as Reaction Substrates

  36. Manual Pruning of Pathways • Use pathway evidence report • Coloring scheme aids in assessing pathway evidence • Phase I: Prune extra variant pathways • Rescore pathways, re-generate pathway evidence report • Phase II: Prune pathways unlikely to be present • No/few unique enzymes • Most pathway steps present because they are used in another pathway • Pathway very unlikely to be present in this organism

  37. Overview Graph

  38. Output from PPP • Pathway/genome database • Summary pages • Pathway evidence page • Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” • Missing enzymes report • Directory tree containing sequence files, reports, etc.

  39. Resulting Directory Structure • ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/ • input • organism.dat • organism-init.dat • genetic-elements.dat • annotations files • sequence files • reports • name-matching-report.txt • trial-parse-report.txt • kb • ORGIDbase.ocelot • data • overview.graph • released -> VERSION

  40. Caveats • Cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • Since many reactions occur in lots of pathways, many false positives

  41. The Pathway Tools Schema

  42. Motivations for Understanding Schema • Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB • When writing Lisp complex queries to PGDBs, those queries must name classes and slots within the schema • A Pathway/Genome Database is a web of interconnected objects; each object represents a biological entity

  43. Reference • Pathway Tools User’s Guide, Volume I • Appendix A: Guide to the Pathway Tools Schema

  44. Web of Relationships for One Enzyme Succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhC sdhD sdhA sdhB TCA Cycle

  45. Frame Data Model and Schema • Frame Data Model -- organizational principle for a DB • Object Displays • Schema • Gene slots • Polypeptide slots • Protein slots • Protein Complex slots • Reaction slots • Enzymatic Reaction slots

  46. Frame Data Model • Knowledge base (KB, Database, DB) • Frames • Slots • Facets • Annotations

  47. Knowledge Base • Collection of frames and their associated slots, values, facets, and annotations • Can be stored within • An Oracle DB • A disk file • A Pathway Tools binary program

  48. Frames • Entities with which facts are associated • Kinds of frames: • Classes: Genes, Pathways, Biosynthetic Pathways • Instances (objects): trpA, TCA cycle • Classes: • Superclass(es) • Subclass(es) • Instance(s) • A symbolic frame name (id, key) uniquely identifies each frame

  49. Slots • Encode attributes/properties of a frame • Integer, real number, string • Represent relationships between frames • The value of a slot is the identifier of another frame • Every slot is described by a “slot frame” in a KB that defines meta information about that slot

  50. Slot Links Succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhC sdhD sdhA sdhB TCA Cycle in-pathway reaction catalyzes component-of product

More Related