1 / 38

microbial genome annotation

microbial genome annotation. to annotate : make or furnish critical or explanatory notes or comments annotation : note or comment. bacterial genome features. sizes: 0.6 Mb (Myocplasma) up to 10 Mb (Myxobacteria). chromosomes: circular, linear, megaplasmids.

mura
Télécharger la présentation

microbial genome annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. microbial genome annotation to annotate: make or furnish critical or explanatory notes or comments annotation: note or comment

  2. bacterial genome features sizes:0.6 Mb (Myocplasma) up to 10 Mb (Myxobacteria) chromosomes: circular, linear, megaplasmids G+C content: 25% (Buchnera) up to 75% (Streptomyces) average gene density: one gene per 1-1.3 bp intergenic regions: 10 - 20% organization: many genes form operons or gene islands genome plasticity: insertions, deletions, inversions, gene pools

  3. sequenced vs. experimentally characterised genes current status in genome sequencing published genome sequencing projects since 1995 sequenced characterised

  4. shotgun sequencing fragmentation of genomic DNA (2-10 kbp) plasmidlibrary of genomic fragments (8-10fold genome coverage) sequencing ( 500 bp reads) ..AGGCATCTAGGATTACCATCTACTT ..AGCTATCGAGCATCTAGGATTACCATCTA TACCATCTACTTCTCATTTTCTAAATA.. GATTACCATCTACTTCTCATTTTCTAAATATCGCGCA.. assembly of overlapping sequences gap-closure annotation

  5. genome sequencing project restriction mapping cosmid mapping primary annotation nucleotide sequence analysis functional data comparative genomics example: Pseudomonas putida KT2440 shot-gun sequencing assembly and gap closure

  6. data interpretation

  7. automatic gene finding prediction of putative coding regions, application of 1 or more algorithms annotation strategy sequencing biological databases

  8. gene finding • Gene finding: • often based on probabilistic models (for instance HMMs, IMMs) • many algorithms available • no perfect algorithm (no 99.9%), false positives and false negatives • > additional evaluation needed (overlaps, intergenic regions, short genes, start sites)

  9. gene finding with Glimmer • Glimmer uses an Interpolated Markov Model(IMM) to predict those open reading frames (ORFs) most likely to be genes • The IMM makes predictions based on statistical probabilities generated when the model building algorithm is trained on a set of ‚known‘ genes (long ORFs, ORFs that match to known proteins) • The algorithm calculates the occurence of base x following oligomer y (up to 8mers) in the set of ‚known‘ genes and generates the probability of each combination occuring in a real gene • These probabilities are then used to predict whether any given ORF is a real gene or not.

  10. automatic gene finding prediction of putative coding regions, application of 1 or more algorithms similarity searches, assignments to protein families etc., sequence features, suggestion of function, classification automatic annotation validation of gene finding and automatic annotations, additional database searches, literature searches and other information sources, contextual analysis manual annotation validation and update of previous annotations re-annotation annotation strategy sequencing biological databases

  11. sequencing errors, assembly errors false postives, start sites, false negatives false negatives, under- and over-prediction false positives, over-prediction, domain error, false negatives, under-prediction, undefined source, typographical errors errors in annotation sequencing biological databases automatic gene finding biological databases automatic annotation manual annotation re-annotation

  12. reannotation scores Score Description Comment 7 False positive Original annotation predicts function without any supporting evidence 6 Over-prediction Original annotation predicts a specific biochemical function without sufficient supporting evidence 5 Domain error Original annotation overlooks different domain structure of query and reference proteins 4 False negativeOriginal annotation does not provide predicted function although there is sufficient evidence to characterize the query protein 3 Under-prediction Original annotation predicts a nonspecific biochemical function although a more detailed prediction could have been made 2 Undefined source Original annotation contains undefined terms, non-homology based predictions, and so on 1 Typographical error Original annotation contains typographical errors that may be propagated in the database 0 Total agreement Original annotation is correct, but annotations may be only semantically (but not computationally) identical CA Ouzounis and PD Karp, Genome Biology 2002, 3 (2): comment2001.1-2001.6

  13. annotation • basic annotation: • name, gene symbol, functional category • gene characteristics (length, position, G+C content, ...) • protein characteristics (domains/motifs, MW, PI, ...) • extended annotation: • genomic context, phylogenetic relations • protein interaction, pathways • further gene characteristics (codon-usage, oligonucleotides) • experimental data (high troughput data)

  14. diversity of nomenclature descriptive: multidrug efflux MFS transporter multidrug resistance efflux pump homolog efflux pump protein multidrug resistance protein B EmrB protein consistent: histidin sensor kinase sensor kinase two component sensor kinase transmembrane sensor kinase two component system, transmembrane sensor sensor histidin kinase sensory box protein

  15. new developments annotation using Gene Ontology (GO) categories • controlled vocabulary • annotation according to function, process, or localization • combination of these ontologies • evidence codes example:translation factor

  16. annotation strategies homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures

  17. pairwise alignment • search again protein databases: GenBank, SwissProt, PIR, ... • different algorithms: Blast, PSI-Blast, Fasta, Smith-Waterman • different search strategies: • combination of Blast and Fasta • Blast search followed by Smith-Waterman alignment • PSI-Blast (iterative search, builds a profile in the first run and repeats search against profile Example_1: functional characterization possible Example_2: functional characterization impossible Example_3: functional characterization ambiguous

  18. known problems • no cut-off values, leads to overprediction • different degrees of conservation during evolution • ambigous substrate/interaction specificity • no information on orthology/paralogy • wrong annotations, database artefacts • transitive annotation • multidomain proteins

  19. transitive annotation 30% 30% B (database entry from sequencing project) 30% 30% C (well characterised database entry) A is like B, B is like C, but C is not like A A (new predicted protein)

  20. multidomain proteins B-domain proteins A-domain proteins • multidomain problem dominant domains A B X (new sequence) 30% 70%

  21. protein families - HMM AIEEGEILVIMGLSGSGKST AIEEGEIFVIMGLSGSGKST EVYDGEIFVIMGLSGSGKST KIAKGEFICFIGPSGCGKTT DILKGEFICFIGPSGCGKTV eIakGEifvimGlSGsGKsT +++ GEi+ ++G SGsGKs DLYRGEILAVVGGSGSGKSV HMM highly curated multiple alignment of well characterised seed proteins generation of Hidden Markov Model (HMM) including cutoffs alignment to genome proteins, assignment of scores

  22. protein families > databases: Pfam, TIGRfam, Smart > based on highly curated sets of proteins known to share the same or similar fuctions or be members of the same family > family name often refers to well characterized members > further classification into super- or subfamilies possible > trusted cutoff and noise cutoff can be used for evaluation of assignments example 1: uncovering of a MFS family transporter example 2: porins in P. putida KT2440

  23. porins

  24. motifs/domains/structure PROSITE motifs: binding sites, phosphorylation sites, membrane anchors Lipoprotein motifs: putative lipid modification Signal peptides: characteristic for membrane and extracellular proteins Membrane spanning regions: typical for transporter, sensors, etc. Secondary structures: helices, ß-sheets, coils Tertiary structures: scan against profiles of known protein structures

  25. homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics annotation strategies

  26. metabolic pathways arylsulfatase acyl-CoA dehydrogenase ori aat fcs vdh ech regulator 4-hydroxycinnamic acid MFS transporter Degradation of ferrulic acid by Pseudomonas spp. vanB vanA pcaA pcaB Jörg Overhage, Horst Priefert,* and Alexander Steinbüchel, AEM 1999, 65:4837ff

  27. homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures genomic context > orthology, phylogeny > conserved neighborhood > operon structure > gene fusion, protein interaction > phylogenetic profiles annotation strategies phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics

  28. orthology / paralogy > gene A from genome 1 is the ortholog of gene B from genome 2 if: • gene A is best homolog of gene B among all genes in genome 1 • gene B is best homolog of gene A among all genes in genome 2 > orthologs are genes that have diverged from each other after specification events > paralogs are genes that have diverged from each other after gene duplication events > homologs are genes that descent from acommon ancestor gene

  29. orthology / paralogy 50% lysin transporter Y gene X 70% gene Z genome B genome A gene Z  gene Y : orthologs gene X  gene Y : homologs gene X  gene Z : paralogs

  30. paralogous families • protein clustering of a complete genome detects paralogous families • members of the same protein families share conserved domains that are often connected with a function, localization or process • paralogous families used for maintaining consistency of annotation, start-site editing • can be useful for genome comparision Enright et al. 2002, NAR 30, 1575-84

  31. conserved neighborhood • neighborhood (gene order or proximity) of two or more genes is conserved between different taxospecies • assumption: gene order conserved during evolution due to co-regulation or co-transcription, therefore participation in the same complex, pathway or process • problem: false positivs due to short phylogenetic distances

  32. conserved neighborhood Inner membrane permease protein ATP-binding protein putative periplasmic substrate-binding protein ABC transporter Binding protein permease putative ABC transporter ATPase conserved hypothetical, one TM domain

  33. gene fusion analysis gene Y gene X gene Y gene Z gene X gene Z genome B genome A • orthologs of individual ‚component‘ proteins in a genome A are fused into a single protein in a genome B • assumption: component proteins in genome A are involved in the same (or similar) protein complex, pathway or process • problem: increased number false positivs with increasing level of paralogy

  34. phylogenetic profiles species a species b species c species d species e species f protein 1 protein 2 protein 3 protein 4 protein 5 • phylogenetic profiles are co-occurence patterns of genes (orthologs) in different genomes • assumption: similar phylogenetic profiles of genes could implicate participation in the same pathway or process • problem: false positivs due to short phylogenetic distances and high noise-to-signal ratio

  35. homology/structure > pairwise homology > protein domains/families > binding-sites > amino acid composition > secondary structures > 3D structures genome associated features > codon-usage > mobile elements, islands > oligonucleotide frequencies > promotor/terminator > RNAs annotation strategies phenotype/experiments > metabolic pathways > physiological features > localisation > expression data > knock-out phenotypes > comparative genomics genomic context > orthology, phylogeny > conserved neighborhood > operon structure > gene fusion, protein interaction > phylogenetic profiles

  36. genome sequence features • nucleotide content • oligonnucleotide bias • oligonucleotide variance > all three features are expected to be relatively constant throughout the genome • codon usage • (oligo)nnucleotide skew • third position GC skew • repeats > atypical sequence features often indicate alien DNA, highly/lowly expressed genes, or unusual structural features

  37. detection of gene islands Region 1 type II secretion proteins, others Region 2 lps/eps biosynthesis cluster Region 3 arsenate detooxification operon; unknown operon Region 4 prophage Region 5 hypothetical proteins, large non-coding regions Region 6+7 transposons Region 8+9 heavy metal resistance genes (tranposons?)

More Related