230 likes | 352 Vues
Pathway Tools Meeting - December 1, 2005, Geneva (SIB). :. &. Putting together synteny and metabolic information to achieve relevant expert annotation of microbial genomes. Dr Claudine Médigue. What is MaGe ? Yet another bacterial annotation platform !….
E N D
Pathway Tools Meeting - December 1, 2005, Geneva (SIB) : & Putting together synteny and metabolic information to achieve relevant expert annotation of microbial genomes Dr Claudine Médigue
What is MaGe ? Yet another bacterial annotation platform !… • Shares functionalities with other existing annotation systems : • An automatic annotation process : Syntaxic and functional annotations Functional annotation and classification inferences • A relational database (MySQL) used to store the sequences and • the analysis results. • A WEB interface allowing multiple users to simultaneously annotate a • genome. • Developed by biologists involved in manual expert annotation • Connectivity to other databases or systems • Graphical interface which focuses on gene context and synteny results with available bacterial proteomes. • Its development started in Oct. 2002 Context : the Acinetobacter sp. ADP1 genome annotation (Summer 2004)
Introduction to the Prokaryotic Genome DataBase (PkGDB) • Complete bacterial genomes(Refseq NCBI and Genome Review EBI) • Integration in PkGDB Correction of obvious errors Management of frameshifts • Syntactic re-annotation NAR (WS), 2003 Add missing gene annotations NAR (WS), 2005 • New bacterial genomes(annotation projects) • Annotation tool results : • Intrinsic: genes, signals, repeats,… • Extrinsic : BLAST, InterPro, COG, synteny … Purpose: storage of ‘clean’ and complete annotation data which are subsequently used in the genomic comparative analysis. • Relational SGBD (MySQL)
Simplified structure of PkGDB Project customization • Multiple correspondences • Local rearrangements (ins/del) Boyer et al. Bioinformatics (Nov 2005) Annotation project Re-annotation project Published genomes Newly sequenced genomes Reference annotation for model organisms NCBIRefSeq GenomeReviews Gene prediction AMIGene Subtilist Geneprotec Ecogene Annotation management Sequence updates and annotation transfer Annotator management Genomic ObjectsAutomatic and manual functional assignations Functional Classification Annotation history GeneOntology MultiFun Functional predictions Specific regions Protein similarities helixes and signal peptides Enzymatic functions Domains and motifs Orthologs & Paralogs COG Interpro Uniprot BioCyc KEGG Syntenies
How to read the synteny maps ? This P. syringae gene (PSPTO0599/hutH-1) is a putative ‘ortholog’ to ACIAD0574 and is involved in a synteny group containing 17 genes (in green) Two ‘homologs’ to ACIAD0574 on the P. aeruginosa genome These two P. syringae genes (PSPTO5274/hutH-2 and 5276/ hutH-3) are similar to ACIAD0574 (putative paralogs of PSPTO0599) ACIAD0574 hutH
A larger view of the previous Acinetobacter ADP1 region 0574 0562 0582-0583 hutH hisS fabG-fabF 4 of 138 genomes in PkGDB 9 of 284 complete microbial proteomes (RefSeq section)
How are genes organized in a synteny group ? Synteny with Ralstonia solanacearum chromosome Synteny with Ralstonia solanacearum Mega Plasmid
Synteny maps are useful to annotate gene fusion/fission Fusion of genes involved in DNA replication dnaQ (DNA polIII, epsilon subunit + proofreading 3’-5’ exonuclease) rnhA (degradation of Okazaki fragments) (dnaQ)YPO1082 (dnaQ)STM0264 (dnaQ)PA1816 YPO1081(rnhA) STM0263(rnhA) PSPTO3712(rnhA) (dnaQ)NMB1514 (rnhA)NMB1618 PA1815(rnhA) (dnaQ)PSPTO3711 Colored rectangles represent the part of the protein which aligns with the corresponding Acinetobacter protein.
Simplified structure of PkGDB Project customization PRIAM http://bioinfo.genopole-toulouse.prd.fr/priam/ Position-specific scoring matrices ('profiles') built with SwissProt proteins Dynamic requests Local installation http://www.biocyc.org/ www.genome.jp/kegg/ Annotation project Re-annotation project Published genomes Newly sequenced genomes Reference annotation for model organisms Reference annotation for model organisms NCBIRefSeq GenomeReviews Gene prediction AMIGene Subtilist Subtilist Geneprotec Geneprotec Ecogene Ecogene Annotation management Sequence updates and annotation transfer Annotator management Annotator management Genomic ObjectsAutomatic and manual functional assignations Functional Classification Functional Classification Annotation history GeneOntology GeneOntology MultiFun MultiFun Functional predictions Specific regions Protein similarities helixes and signal peptides Enzymatic functions Domains and motifs Orthologs & Paralogs COG Interpro Uniprot BioCyc KEGG Syntenies
Setting up a new annotation project : an example Newly sequencedgenomes Available related sequences Genomes in public DataBanks • Rhizobium leguminosarum (Sanger Center) • Rhodobacter sphaeroides (DOE/JGI) • Rhodospirillum rubrum(DOE/JGI) • Mesorhizobium loti (00) • Sinorhizobium meliloti (01) • Bradyrhizobium japonicum (02) • Rhodopeudomonas palustris (03) • Bradyrhizobium sp. ORS278 (Genoscope) -> 1 chr (7,5 Mb) • Bradyrhizobium sp. BTAi (DOE/JGI)-> 1 chr (8,5 Mb) Re-annotation process (pseudogenes, missing genes) Complete pipeline of automaticannotations Automatic syntaxic annotations (in some cases, functional annotations) Searching for synteny groups with complete proteomes available in RefSeq section (NCBI, 284 to date) and in PkGDB (curated genomes, 138 to date) PkGDB Metabolic pathway reconstruction Ocelot object model Pathway Tools BrajapCyc RhizoScope BradyORCyc BradyBTCyc YersiniaScope ColiScope FrankiaScope RhizoCyc AcinetoScope CloacaScope BioWareHouse relational model
Comparative Metabolic Capabilities : an example Bradyrhizobium sp. ORS278 Bradyrhizobium sp. BTAi 830 873 76 14 43 724 30 16 127 Bradyrhizobium japonicum USDA 110 897 Reaction content comparisons between the 3 Bradyrhizobium organisms (BioWareHouse SQL query on reactions having gene-> protein->reaction correspondences )
Bradyrhizobium ORS278 region containing CDS 5771&5772 15277747 !!! !!! ??? BRAOR5771-5772 - 5773 “Cloning and Characterization of the Genes Encoding Enzymes for the Protocatechuate Meta-degradation Pathways of Pseudomonas ochraceae NGJ1” Maruyama et al. (2004) Biosci. Biotechnol. Biochem, 68, 1434-1441.
AUTOmatic vs EXPert annotation of the region Evidence EC-number Gene PRODUCT BRAOR5770 BLAST R. palus PRIAM (medium) ligC 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase 1.1.1.18 AUTO BLAST P. testosteroni Publication + Enzyme EXP ligC 1.2.1.45 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase BRAOR5771 BLAST R. palus PRIAM (high) ligB AUTO = 1.13.11.8 Protochatechuate 4,5-dioxygenase, alpha subunit EXP BRAOR5772 BLAST R. palus PRIAM (high) ligA 1.13.11.8 Protochatechuate 4,5-dioxygenase, beta subunit AUTO = EXP BRAOR5773 ligI none 2-pyrone-4,6-dicarboxylic acid hydrolase BLAST R. palus AUTO BLAST R. palus Publication + Enzyme EXP 2-pyrone-4,6-dicarboxylic acid hydrolase ligI 3.1.1.57 BRAOR5774 Putative dehydrogenase BLAST R. palus none none AUTO BLAST R. palus InterproScan EXP Putative dehydrogenase with NAD binding protein none 1.1.1.- BRAOR5775 Putative acyl transferase BLAST R. palus none fidZ AUTO BLAST P. ochraceae Publication + Enzyme EXP ligK 4.1.3.17 4-hydroxy-4-methyly-2-oxoglutarate aldolase BRAOR5776 ligJ none BLAST R. palus 4-oxalomesaconate hydratase AUTO BLAST R. palus Publication + Enzyme EXP ligJ 4-oxalomesaconate hydratase 4.2.1.83
Bradyrhizobium ORS278 region after expert annotation BRAOR5771-72 BRAOR5777 BRAOR5778 BRAOR5770 BRAOR5773 BRAOR5776 BRAOR5775 4.2.1.83 3.1.1.57 1.2.1.45 1.13.11.8 ligJ ligI 4.1.3.17 ligBA ligC ligK
Connectivity to KEGG database Enzymes encoded by genes in the MaGe region Enzymes encoded by genes elsewhere in the Bradyrhizobium genome Additional enzymes in E. coli ? 4.2.1.83
Connectivity to KEGG database Enzymes encoded by genes in the MaGe region Enzymes encoded by genes elsewhere in the Bradyrhizobium genome Additional enzymes in E. coli
Bradyrhizobium ORS278 region after expert annotation Probable transcriptional regulator of protochatechuate degradation Probable protochatechuate transporter 5776 5770 5772 5773 5771 5775 BRAOR5777 BRAOR5778 BRAOR5770_ligC 4-carboxy-2-hydroxymuconate 6-semialdehyde dehydrogenase 1.2.1.45 BRAOR5776_ligJ 4-oxalmesaconate hydratase 4.2.1.83 ligR The reactions catalyzed by 1.2.1.45 and 4.2.1.83 exist in MetaCyc but they are not involved in a pathway.
Enzymatic activity predictions (PRIAM) : some results 912 (52.8%) 820 (56.3%) 632 (62.5%) 697 (75.2%) EC_[P] = EC_[E] 68 (3.9%) 46 (3.2%) EC_[P](3 digit) = EC_[E] 47 (4.6%) 23 (2.5%) 401 (23.2%) 285 (19.6%) EC_[P] <> EC_[E] 131 (12.9%) 102 (11.0%) 348 (20.1%) 304 (20.9%) EC_[P] & (NO EC_[E]) 202 (20.0%) 105 (11.3%) 111 (7.4%) 90 (7.3%) EC_[E] & (NO EC_[P]) 111 (11.7%) 152 (15.3%) • Limitations of PRIAM sequence-based enzyme prediction • Availability of at least one UniProt/SwissProt sequence in the Enzyme entry ! • Existence of closely related enzymes with different substrate specificity • Relaxed substrate specificity exhibited by some enzymes • Several wrong predictions in case of Medium/Low PRIAM confidence • Comparison of PRIAM predictions [P] and Expert annotations [E] Frankia alni Pseudomonas entomophila Pseudoalteromonas haloplanktis Acinetobacter ADP1 6861 5182 Total genes 3514 3325 1729 / 1498 Nb EC_[P] vs EC_[E] 1455 / 1232 927 / 993 1012 / 947
PGDBs built at Genoscope • Our PGDBs are currently available in the MaGe’s interface HomePage : http://www.genoscope.cns.fr/agc/mage/ • NO curation to date (Tier 3* Databases) (except for Acinetobacter ADP1-> Metabolic Thesaurus project) • MaGe’s training courses include a quick overview of how to explore PathoLogic results to perform relevant expert annotation • Automatic updates of PathoLogic predictions : every week • To date : about 60 Tier 3 PGDBs • 16 PGDBs common to SRI/EBI PGDBs Tier3* (and 4 with Tier2*): «Expansion of the BioCyc collection of pathway/genome databases to 160 genomes» Karp et al. Nucleic Acid Research, 2005, 33: 6083-6089. • The number of enzymes and pathways is slightly greater in our PGDBs (source of annotations + process of Pathologic file format generation) • Important discrepancies with Sinorhizobium meliloti (44 predicted pathways in the SRI/EBI PGDB vs 259 in the Genoscope PGDB) • 18 PGDBs : other published bacterial genomes • 25 PGDBs for newly sequenced and annotated bacterial genomes *Tier 3: Computationally-Derived Databases Subject to No Curation *Tier 2: Computationally-Derived Databases Subject to Moderate Curation
Some Questions / Perspectives • Curation of PGDB ? • Integration and evaluation of Pathway Hole Filler • Remove false-positive pathway (Tier 3 -> Tier2) • Automatic reduction of false positive pathway predictions stored in the PGDBs • Better correspondences between BioCyc and MaGe • Finding a way to get a list of false positive pathways at the end of the manual process of annotation. • Optional fields in the PathoLogic file format (PubMedID, Funcat, …) • Tier2 -> Tier1*, especially creation of new metabolic pathways : • How to tackle the pseudogene information ? No enzyme has been found !!!Not an easy task !!!(a strong knowledge of metabolism is required) Pathway X doesn’t exist because • PGDBs freely available for «adoption» by biologists Some enzymes correspond to pseudogenes *Tier1: Intensively Curated Databases
Metabolic Thesaurus project at Genoscope Knock-out collection Biological evidence 2240 ADP1 genes knocked out Systematic phenotyping Annotation Accurate phenotyping Model Biochemical studies Functional complementation Network reconstruction Transcriptome analyses Flux Models Metabolism prediction Vincent Schächter’s bioInformatic team Véroniquede Berardinis’s team 3325 Acinetobacter ADP1 annotatedgenes
Metabolic Pathway Reconstruction / Experimental Data ColiScope Metabolic Thesaurus Sequencing of 2 commensal and 4 pathogenic E. coli strains Acinetobacter ADP1 KO collection Phenotypic analysis: growth essay on different nutrient sources + Metabolome analysis: LC/MS and CE/MS Data Integration and Comparative Analysis Evolution of metabolic capabilities => adaptation of microorganisms commensalism / virulence emergence Linked enzymatic activity to genes of unknown function
Participating teams • AGC team : • Zoé Rouy • David Vallenet • Aurélie Lajus • Stéphane Cruveiller • Claudine Médigue • Genoscope informatic system team • Claude Scarpelli • Laurent Sainte-Marthe • Sylvain Bonneval • … and with the help of : • François Lefèvre (V. Schächter team) • Mage’s users feedback helps in improving many functionalities of our system !