1 / 17

Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012. Introduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes. Outline. features. Well-annotated bacterial genome in Artemis genome viewer:.

alicia
Télécharger la présentation

Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

  2. Introduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes Outline

  3. features Well-annotated bacterial genome in Artemis genome viewer: Finding the genes in microbial genomes • Sequence features in prokaryotic genomes: • stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) • protein-coding genes (CDSs) • transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) • translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) • pseudogenes (tRNA and protein-coding genes) • …

  4. Introduction 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Known problems 4. Metagenomes Outline

  5. IMG-ER http://img.jgi.doe.gov/ RAST http://rast.nmpdr.org/ JCVI Annotation Service http://www.jcvi.org/cms/research/projects/annotation-service/ NCBI PGAAP http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html Publicly available genome annotation services

  6. “Non-coding” RNAs 23S, 16S and 5S rRNAs tRNAs (tmRNA, RNaseP RNA component) Protein-coding genes Topological features (signal peptides, transmembrane regions) Repeats (CRISPRs, IS elements) Regulatory RNAs (riboswitches) Pseudogenes and frameshifts Features predicted by most pipelines

  7. Best way – covariance models (captures the secondary structure) Used for small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE Does not work for large RNAs (23S, 16S). Alternatives: BLASTn RNAmmerhttp://www.cbs.dtu.dk/services/RNAmmer/ RNA prediction

  8. CDS (not ORF!) prediction Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon

  9. Two major approaches to prediction of protein-coding genes: ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity; very fast! Limitations: often misses “unusual” genes; high rate of false positives “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow! Two major approaches to CDS prediction

  10. Statistical model of coding and non-coding regions Statistical model(s) of other components (RBS) Additional algorithms for refinement of predictions (overlap resolution, etc.) How ab initio tools work – very briefly Prokaryotic gene model used by all ab initio gene finders Ribosome-binding site within certain distance of the start codon; One of 3 start codons; One of 3 stop codons; No frame interruptions Ribosome binding site open reading frame Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA

  11. Ab initio tools used by the pipelines: Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI, RAST, JCVI http://glimmer.sourceforge.net/ GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBI http://exon.gatech.edu/GeneMark/ PRODIGAL -> IMG-ER, NCBI http://compbio.ornl.gov/prodigal/ Evidence-based refinement: mostly undocumented in-house developed tools. Types of corrections: missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites (RAST) Gene finders: ab initio tools; evidence-based refinement

  12. RNAs Incomplete rRNAs Trans-spliced tRNA in archaeal genomes Small structural RNAs not predicted at all Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders no RBS (leaderless transcripts) interrupted translation frame sequencing errors or translational exceptions non-canonical start Known problems of all annotation pipelines Ribosome binding site open reading frame Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA

  13. Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing “Truncated” genes (shorter than homologs) => unusual translation initiation features (non-canonical start codons, leaderless transcripts) Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts) Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families Symptoms of gene finding problems

  14. Supplemental tools • TIS (translation initiation site) prediction/correction TICO http://tico.gobics.de/ TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA • Two tools often disagree about the best TIS, especially in high GC genomes • Operon prediction JPOP http://csbl.bmb.uga.edu/downloads/#jpop http://www.cse.wustl.edu/~jbuhler/research/operons/ http://www.sph.umich.edu/~qin/hmm/ • Proteins with unusual translational features – selenocysteine-containing genes bSECISearch http://genomics.unl.edu/bSECISearch/

  15. Metagenomes sequenced with new technologies: low-coverage problems • Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x) • High sequence coverage cannot be achieved for metagenome data How does this affect metagenome annotation? metagenome coverage genome coverage sequence sequence ~70% of 454 Titanium reads have at least 1 sequencing artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution >100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions

  16. Just one example… 4-read contig, 1476 nt, no misassembly 3 frameshifts Contig has 27 homopolymers (3 nt and more), 3 of them have errors No correlation with homopolymer type or error type Reads were quality trimmed prior to assembly predicted gene

  17. GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) http://exon.gatech.edu/GeneMark/ MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ FragGeneScan http://omics.informatics.indiana.edu/FragGeneScan/ Full-service annotation pipelines IMG/M-ER – “metagenome gene calling” + other options http://img.jgi.doe.gov/submit MG-RAST http://metagenomics.nmpdr.org/ CAMERA annotation pipeline http://camera.calit2.net/ Metagenome annotation tools (more details will be given)

More Related