PathoLogic Pathway Predictor

PathoLogic Pathway Predictor

Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map

PathoLogic Functionality • Initialize schema for new PGDB • Transform existing genome to PGDB form • Infer metabolic pathways and store in PGDB • Infer operons and store in PGDB • Assemble Overview diagram • Assist user with manual tasks • Assign enzymes to reactions they catalyze • Identify false-positive pathway predictions • Build protein complexes from monomers • Infer transport reactions

PathoLogic Input/Output • Inputs: • File listing genetic elements • http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat • Files containing DNA sequence for each genetic element • Files containing annotation for each genetic element • MetaCyc database • Output: • Pathway/genome database for the subject organism • Reports that summarize: • Evidence contained in the input genome for the presence of reference pathways • Reactions missing from inferred pathways

PathoLogic Analysis Phases A C G • Trial parsing of input data files [few days] • Initialize schema of new PGDB [3 min] • Create DB objects for replicons, genes, proteins [5 min] • Assign enzymes to reactions they catalyze • ferrochelatase [10 min / 1 week] • glutamate 1-semialdehyde 2,1-aminomutase • porphobilinogen deaminase E1 E2 B D E F

PathoLogic Analysis Phases • From assigned reactions, infer what pathways are present [5 min / few days] • Define metabolic overview diagram [30 min] • Define protein complexes [few days]

genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //

File Naming Conventions • One pair of sequence and annotation files for each genetic element • Sequence files: FASTA format • suffix fsa or fna • Annotation file: • Genbank format: suffix .gbk • PathoLogic format: suffix .pf

Typical Problems Using Genbank Files With PathoLogic • Wrong qualifier names used: read PathoLogic documentation! • Extraneous information in a given qualifier • Check results of trial parse carefully

GenBank File Format • Accepted feature types: • CDS, tRNA, rRNA, misc_RNA • Accepted qualifiers: • /locus_tag Unique ID [recm] • /gene Gene name [req] • /product [req] • /EC_number [recm] • /product_comment [opt] • /gene_comment [opt] • /alt_name Synonyms [opt] • /pseudo Gene is a pseudogene [opt] • For multifunctional proteins, put each function in a separate /product line

PathoLogic File Format • Each record starts with line containing an ID attribute • Tab delimited • Each record ends with a line containing // • One attribute-value pair is allowed per line • Use multiple FUNCTION lines for multifunctional proteins • Lines starting with ‘;’ are comment lines • Valid attributes are: • ID, NAME, SYNONYM • STARTBASE, ENDBASE, GENE-COMMENT • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT • DBLINK • INTRON

PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P

Before you start: What to do when an error occurs • Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. • All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger • Unix: error message will show up in the original terminal window from which you started Pathway Tools. • Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt • 2 goals when an error occurs: • Try to continue working • Obtain enough information for a bug report to send to pathway-tools support team.

The Lisp Debugger • Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): • To generate debugging information (stack backtrace): :zoom :count :all • To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 • To exit Pathway Tools: :exit

How to report an error • Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) • Send email to ptools-support@ai.sri.com containing: • Pathway Tools version number and platform • Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem • error.tmp file, if one was generated • If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

Using the PPP GUI to Create a Pathway/Genome Database • Input Project Information • Organism -> Create New

Input Project Information

Next Steps • Trial Parse • Build -> Trial Parse • Fix any errors in input files • Build pathway/genome database • Build -> Automated Build

PathoLogic Parser Output

Assign Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase Match yes no Probable enzyme -ase Assign UDP-D-glucose  UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign

Enzyme Name Matcher • Matches on full enzyme name • Match is case-insensitive and removes the punctuation characters “ -_(){}',:” • Also matches after removal of prefixes and suffixes such as: • “Putative”, “Hypothetical”, etc • alpha|beta|…|catalytic|inducible chain|subunit|component • Parenthetical gene name

Enzyme Name Matcher • For names that do not match, software identifies probable metabolic enzymes as those • Containing “ase” • Not containing keywords such as • “sensor kinase” • “topoisomerase” • “protein kinase” • “peptidase” • Etc • Research unknown enzymes • MetaCyc, Swiss-Prot, PubMed

Enzyme Name to Reaction Mapping See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt

Manual Polishing • Refine -> Assign Probable Enzymes  Do this first • Refine -> Rescore Pathways  Redo after assigning enzymes • Refine -> Create Protein Complexes  Can be done at any time • Refine -> Assign Modified Proteins  Can be done at any time • Refine -> Transport Identification Parser  Can be done at any time • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Update Overview  Do this last, and repeat after any material changes to PGDB

Assign Probable Enzymes

How to find reactions for probable enzymes • First, verify that enzyme name describes a specific, metabolic function • Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed • Look up protein in SwissProt or other DBs • Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) • Search for function name in PubMed • Other…

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Automated Pathway Inference • All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion. • Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.

Considerations taken into account when deciding whether or not a pathway should be inferred: • Is there a unique enzyme – an enzyme not involved in any other pathway? • Does the organism fall in the expected taxonomic domain of the pathway? • Is this pathway part of a variant set, and, if so, is there more evidence for some other variant? • If there is no unique enzyme: • Is there evidence for more than one enzyme? • If a biosynthetic pathway, is there evidence for final reaction(s)? • If a degradation pathway, is there evidence for initial reaction(s)? • If an energy metabolism pathway, is there evidence for more than half the reactions?

Assigning Evidence Scores to Predicted Pathways • X|Y|Z denotes score for P in O • where: • X = total number of reactions in P • Y = enzymes catalyzing number of reactions for which there is evidence in O • Z = number of Y reactions that are used in other pathways in O

Manual Pruning of Pathways • Use pathway evidence report • Coloring scheme aids in assessing pathway evidence • Phase I: Prune extra variant pathways • Rescore pathways, re-generate pathway evidence report • Phase II: Prune pathways unlikely to be present • No/few unique enzymes • Most pathway steps present because they are used in another pathway • Pathway very unlikely to be present in this organism • Nonspecific enzyme name assigned to a pathway step

Caveats • Cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • Since many reactions occur in multiple pathways, some false positives

Output from PPP • Pathway/genome database • Summary pages • Pathway evidence page • Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” • Missing enzymes report • Directory tree containing sequence files, reports, etc.

Resulting Directory Structure • ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ • input • organism.dat • organism-init.dat • genetic-elements.dat • annotation files • sequence files • reports • name-matching-report.txt • trial-parse-report.txt • kb • ORGIDbase.ocelot • data • overview.graph • released -> VERSION

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Creating Protein Complexes

Complex Subunits Stoichiometries

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Re-run Name Matcher • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Proteins as Reaction Substrates

Manual polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Nomenclature • WO pair = pair of genes within an operon • TUB pair = pair of genes at a transcription unit boundary (delineate operons)

Operation of the operon predictor • For each contiguous gene pair, predict whether gene pairs are within the same operon or at a transcription unit boundary • Use pairwise predictions to identify potential operons AB = TUB pair BC = WO pair operon = BCD CD = WO pair DE = TUB pair A B C D E

Operon predictor • Predicts operon gene pairs based on: • intergenic distance between genes • genes in the same functional class • Typically used for operon prediction • We use method from Salgado et al, PNAS (2000) as a starting point. • Uses E. coli experimentally verified data as a training set. • Compute log likelihood of two genes being WO or TUB pair based on intergenic distance.

Operon predictor Additional features easily computed from a PGDB • both genes products enzymes in the same metabolic pathway • both gene products monomers in the same protein complex • one gene product transports a substrate for a metabolic pathway in which the other gene product is involved as an enzyme • a gene upstream or downstream from the gene pair (and within the same directon) is related to either one of the genes in the pair as per features 1, 2 and 3 above.

PathoLogic Pathway Predictor

PathoLogic Pathway Predictor

Presentation Transcript

SKM MARKET PREDICTOR

Predictor Virtualization

Clinical Pathologic Conference

Pathologic Fractures in Children

Eclipse Predictor

Pathologic features of AD

Temporal Stream Branch Predictor (TS Predictor)

Clinico -Pathologic Conference Pediatrics

Pathologic Fractures

Pathologic Calcification

Clinical Pathologic Conference

Pathologic Fractures – Metastasis

THE PREDICTOR

Baby Gender Predictor

Ovulation Predictor

Incremental PathoLogic

Pathologic Fractures

Pathologic Basis of Disease

PathoLogic Pathway Predictor

A Weather Predictor

Sea Ice

Sea Ice