The Pathway Tools Schema
The Pathway Tools Schema. Motivations for Understanding Schema. Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB
The Pathway Tools Schema
E N D
Presentation Transcript
Motivations for Understanding Schema • Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB • When writing complex queries to PGDBs, those queries must name classes and slots within the schema • A Pathway/Genome Database is a web of interconnected objects; each object represents a biological entity
Reference • Pathway Tools User’s Guide, Volume I • Appendix A: Guide to the Pathway Tools Schema
Web of Relationships for One Enzyme Succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhC sdhD sdhA sdhB TCA Cycle
Frame Data Model • Frame Data Model -- organizational structure for a PGDB • Knowledge base (KB, Database, DB) • Frames • Slots • Facets • Annotations
Knowledge Base • Collection of frames and their associated slots, values, facets, and annotations • AKA: Database, PGDB • Can be stored within • An Oracle DB • A disk file • A Pathway Tools binary program
Frames • Entities with which facts are associated • Kinds of frames: • Classes: Genes, Pathways, Biosynthetic Pathways • Instances (objects): trpA, TCA cycle • Classes: • Superclass(es) • Subclass(es) • Instance(s) • A symbolic frame name (id, key) uniquely identifies each frame
Frame IDs • Naming conventions for frame IDs • Uniqueness of frame IDs • Frame IDs must be unique within a PGDB • Goal: Same frame ID within different PGDBs should refer to the same biological entity • Because many frames are imported from MetaCyc, this helps ensure consistency of frame names • Frame IDs for newly created frames (not imported) are generated by Pathway Tools • Those frame IDs contain a PGDB-specific identifier • Example: CPLXzz-nnnn CPLXB3-0035
Slots • Encode attributes/properties of a frame • Integer, real number, string, symbols • Represent relationships between frames • The value of a slot is the identifier of another frame • Every slot is described by a “slot frame” in a KB that defines meta information about that slot
Slot Links Succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhC sdhD sdhA sdhB TCA Cycle in-pathway reaction catalyzes component-of product
Slots • Number of values • Single valued • Multivalued: sets, bags • Slot values • Any LISP object: Integer, real, string, symbol (frame name) • Slotunits define properties of slots: datatypes, classes, constraints • Two slots are inverses if they encode opposite relationships • Slot Product in class Genes • Slot Gene in class Polypeptides
Representation of Function EC# Keq Succinate + FAD = fumarate + FADH2 Cofactors Inhibitors Enzymatic-reaction Molecular wt pI Succinate dehydrogenase Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhC sdhD sdhA sdhB TCA Cycle Left-end-position
Monofunctional Monomer Pathway Reaction Enzymatic-reaction Monomer Gene
Bifunctional Monomer Pathway Reaction Reaction Enzymatic-reaction Enzymatic-reaction Monomer Gene
Monofunctional Multimer Pathway Reaction Enzymatic-reaction Multimer Monomer Monomer Monomer Monomer Gene Gene Gene Gene
Pathway and Substrates Reactant-1 Pathway left in-pathway Reactant-2 Reaction Reaction Reaction Reaction Product-1 right Product-2
Transcriptional Regulation trp Int005 apoTrpR Int001 TrpR*trp site001 pro001 Int003 RpoSig70 trpL trpLEDCBA trpE trpD trpC trpB trpA
Principle Classes • Class names are capitalized, plural, separated by dashes • Genetic-Elements, with subclasses: • Chromosomes • Plasmids • Genes • Transcription-Units • RNAs • rRNAs, snRNAs, tRNAs, Charged-tRNAs • Proteins, with subclasses: • Polypeptides • Protein-Complexes
Principle Classes • Reactions, with subclasses: • Transport-Reactions • Enzymatic-Reactions • Pathways • Compounds-And-Elements
Slots in Multiple Classes • Common-Name • Synonyms • Comment • Citations • DB-Links
Genes Slots • Component-Of (links to replicon, transcription unit) • Left-End-Position • Right-End-Position • Centisome-Position • Transcription-Direction • Product
Proteins Slots • Molecular-Weight-Seq • Molecular-Weight-Exp • pI • Locations • Modified-Form • Unmodified-Form • Component-Of
Polypeptides Slots • Gene
Protein-Complexes Slots • Components
Reactions Slots • EC-Number • Left, Right • DeltaG0 • Keq • Spontaneous?
Enzymatic-Reactions Slots • Enzyme • Reaction • Activators • Inhibitors • Physiologically-Relevant • Cofactors • Prosthetic-Groups • Alternative-Substrates • Alternative-Cofactors
Pathways Slots • Reaction-List • Predecessors • Primaries
GKB Editor • Browse class hierarchy and slot definitions • Tools -> Ontology Browser • GKB Editor described at • http://www.ai.sri.com/~gkb/user-man.html
Introduction • MANY ways to access and update PGDBs • APIs in Java, Perl, and Lisp • Import/export of files in many formats • Registry of Pathway/Genome Databases • Import PGDB data into BioWarehouse • Updating a PGDB from an external genome DB
Pathway Tools APIs • Support programmatic queries and updates to PGDBs • APIs in Java, Perl, and Lisp all provide access to a common set of procedures: • Generic Frame Protocol -- Ocelot object database API • Additional Pathway Tools functions • For more information see • http://bioinformatics.ai.sri.com/ptools/ptools-resources.html
Generic Frame Protocol (GFP) • A library of procedures for accessing Ocelot DBs • GFP specification: • http://www.ai.sri.com/~gfp/spec/paper/paper.html • A small number of GFP functions are sufficient for most complex queries • Knowledge of Pathway Tools schema is critical for using the APIs: • Appendix I of Pathway Tools User’s Guide, Vol I
Generic Frame Protocol • get-class-all-instances (Class) • Returns the instances of Class • Key Pathway Tools classes: • Genetic-Elements • Genes • Proteins • Polypeptides (a subclass of Proteins) • Protein-Complexes (a subclass of Proteins) • Pathways • Reactions • Compounds-And-Elements • Enzymatic-Reactions • Transcription-Units • Promoters • DNA-Binding-Sites
Generic Frame Protocol • Notation Frame.Slot means a specified slot of a specified frame • get-slot-value(Frame Slot) • Returns first value of Frame.Slot • get-slot-values(Frame Slot) • Returns all values of Frame.Slot as a list • slot-has-value-p(Frame Slot) • Returns T if Frame.Slot has at least one value • member-slot-value-p(Frame Slot Value) • Returns T if Value is one of the values of Frame.Slot • print-frame(Frame) • Prints the contents of Frame • Note: Frame and Slot must be symbols!
Generic Frame Protocol • coercible-to-frame-p (Thing) • Returns T if Thing is the name of a frame, or a frame object • save-kb • Saves the current KB
Generic Frame Protocol –Update Operations • put-slot-value(Frame Slot Value) • Replace the current value(s) of Frame.Slot with Value • put-slot-values(Frame Slot Value-List) • Replace the current value(s) of Frame.Slot with Value-List, which must be a list of values • add-slot-value(Frame Slot Value) • Add Value to the current value(s) of Frame.Slot, if any • remove-slot-value(Frame Slot Value) • Remove Value from the current value(s) of Frame.slot • replace-slot-value(Frame Slot Old-Value New-Value) • In Frame.Slot, replace Old-Value with New-Value • remove-local-slot-values(Frame Slot) • Remove all of the values of Frame.Slot
Additional Pathway Tools Functions –Semantic Inference Layer • Semantic inference layer defines built-in functions to compute commonly required relationships in a PGDB • http://bioinformatics.ai.sri.com/ptools/ptools-fns.html
Internal note • Note: Refer to local copy of ptools-fns.html to go through the semantic inference layer fns
File Import/Export Capabilities • PGDBs can be exported in whole or part to: • SBML – Systems Biology Markup Language – sbml.org • Import supported by many simulation packages • File -> Export -> Selected Reactions to SBML File • Pathway Tools Attribute-Value format and column-delimited format files • http://brg.ai.sri.com/ptools/flatfile-format.shtml • Dump entire PGDB to a suite of files: File -> Export -> Entire DB to Flat Files • Dump selected frames to a single file: File -> Export -> Selected Frames to File
Import/Export • Import from attribute-value or column-delimited files • File -> Import -> Frames From File • Import/Export to/from internal Pathway Tools format that allows pathways, reactions, enzymes, and compounds to be easily moved between Pathway Tools installations • Edit -> Add Pathway to File Export List • File -> Export -> Selected Pathways to File • File -> Import -> Pathways from File • Import/Export to/from MDL molfile format • Edit -> Import compound structure from molfile • Edit -> Export compound structure to molfile
Miscellaneous Exports • Overview -> Highlight -> Save to File • Overview -> Highlight -> Load from File • Gene / Protein Sequence / Save to file • Chromosome -> Show Sequence of a Segment of Replicon
Napster Comes to Bioinformatics • Public sharing of Pathway/Genome Databases • PGDB registry maintained by SRI at URL http://biocyc.org/registry.html • Registry operations • List contents of registry • Download PGDBs listed in the registry • Register PGDBs you have created
Registry Details • Why register your PGDB? • Declare existence of your PGDB in a central location • Facilitate download by other scientists • Why download a PGDB? • Desktop Navigator provides more functionality than Web • Comparative operations • Programmatic querying and processing of PGDB • Registration process • Registered PGDBs have open availability by default • Authors can provide their own license agreements • Registered PGDBs reside on authors’ FTP site
BioWarehouse • Biospice.org
New Import/Export Tools • Suggestions? • Volunteers?
Updating a PGDB From anExternal Genome DB • Example: AraCyc forms a pathway module to the TAIR DB • TAIR is authoritative source for gene and gene-product information • Update AraCyc to reflect updates in TAIR
Proposed Approach • Export TAIR to PathoLogic files • Build AraCyc2 from those PathoLogic files – automated PathoLogic only • Compare AraCyc1 (A1) to AraCyc2 (A2) A. Import new genes/proteins from A2 to A1 B. Delete from A1 genes/proteins not found in A2 C. Rename genes/proteins whose names changed from A2 to A1 • Run name matcher on A1’ • Check for pathways with no enzymes and report them so user can keep any that otherwise PathoLogic will delete • What about enzymes that were assigned to a pathway by the hole filler? • Re-run pathway predictor • Remember what pathways user deletes so they are not re-predicted by PathoLogic • Consider movement of genes from contig to chromosome