1 / 61

MICROBIAL GENOME ANNOTATION

MICROBIAL GENOME ANNOTATION. Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries. NEB Educational Support. http://www.neb.com/nebecomm/course_support.asp?. Why study Computational Biology and Bioinformatics?.

finian
Télécharger la présentation

MICROBIAL GENOME ANNOTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries

  2. NEB Educational Support http://www.neb.com/nebecomm/course_support.asp?

  3. Why study Computational Biologyand Bioinformatics? • DNA sequencing output is growing faster than Moore’s law! • 1 Illumina sequencing machine = 0.5 Tbp/week • There are hundreds of these and thousands of other sequencing machines around the world. • New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day!

  4. Why study Medical Bioinformatics? • In the near future, most cancer diagnostics will involved DNA or RNA sequencing! • In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments! • Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections.

  5. DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics http://www.jgi.doe.gov/education

  6. Why Study Microbial Genomes? • Large biological mass (50% of total) • photosynthetic (Prochlorococcus) • fix N2 gas to NH3 (Rhodopseudomonas) • NH3 to NO2 (Nitrosomonas) • bioremediation (Shewanella, Burkholderia) • pathogens, BW (Yersinia pestis - plague) • food production (Lactobacillus) • CH4 production (Methanosarcina) • H2 production (Rhodopseudomonas)

  7. Example of Current Microbial Genome Projects • UC Davis – FDA funded 100K bacterial genomes project associated with food. • 5 years = 20K per year / 200 days/year = 100 genomes/day!

  8. Web Resources and Contact Information • http://genome.ornl.gov/microbial/ • http://www.jgi.doe.gov/ • http://genome.jgi-psf.org/ • http://www.jcvi.org/ • http://www.ncbi.nlm.nih.gov/ • http://www.sanger.ac.uk/ • http://www.ebi.ac.uk/ • ftp://ftp.lsd.ornl.gov/pub/JGI • artemis ready files for each scaffold = (feature table plus fasta sequence file) • Contact: • landml@ornl.gov; hauserlj@ornl.gov

  9. Evolution of Sequencing Throughput

  10. Sequenced Microbial Genomes • ARCHAEAL GENOMES • 159 FINISHED; 218 IN PROGRESS • BACTERIAL GENOMES • 3363 FINISHED; 11831 IN PROGRESS • ENVIRONMENTAL COMMUNITIES • > 50,000 samples (see MGRast) • as of Sept 6, 2012 • http://www.expasy.ch/alinks.html • http://www.genomesonline.org • http://metagenomics.anl.gov/

  11. Published Genomes • Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003) • Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003) • Synechococcus WH8102 - Nature 424:1037-1042 (2003) • Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004) • Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004) • Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006) • Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006) • Burkholderia xenovorans – PNAS 103(42):15280-7 (2006) • Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006) • Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007) • Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008) • Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008) • Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008) • Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008) • R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008) • L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)

  12. Basic Annotation Impacts • Design of oligonucleotide arrays • Design & prioritize protein expression constructs • Design & prioritize gene knockouts • Assessment of overall metabolic capacity • Database for proteomics • Allows visualization of whole genome

  13. Additional Analysis Impacts • Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile • Regulatory motif discovery • Operon and regulon discovery • Regulatory and protein association network discovery

  14. Scaffolds or contigs Microbial Annotation Genome Pipeline Simple repeats Prodigal Complex Repeats Model correction tRNAs Final Gene List rRNA, Misc_RNAs InterPro PRIAM Blast COGs TMHMM SignalP GC Content, GC skew Function call Web Pages Feature table

  15. Prodigal (Prokaryotic Dynamic Programming GenefindingAlgorithm) • Unsupervised:  Automatically learns the statistical properties of the genome. • Indifferent to GC Content:  Prodigal performs well irrespective of the GC content of the organism. • Draft:  Prodigal can train on multiple sequences then analyze individual draft sequences. • Open Source:  Prodigal is freely available under the GPL. • Reference:  Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)

  16. G+C Frame Plot Training • Takes all ORFs above a specified length in the genome. • Examines the G+C bias in each frame position of these ORFs. • Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes. • Takes those predicted genes and gathers dicodon usage statistics.

  17. Gene Prediction • Dicodonusage coding score • Length factor added to coding score (GC-content-dependent) • Coding/noncoding thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference). • Dynamic programming to put genes together. • Bonuses for operon distances, larger bonus for -1/-4 overlaps. • Same strand overlap allowed (up to 60 bases). • Opposite strand -->3'r 5'f<- allowed (up to 250 bases)

  18. Start Site ScoringShine Dalgarno Motif • Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency) • Moves starts based on these discoveries. • Gathers statistics on the new set of starts and repeats this process until convergence (5-10 iterations). • RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG). • Does a final dynamic programming with the start scoring function.

  19. Start Site ScoringOther Motifs • If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes. • If Shine-Dalgarno scoring is weak, look for other motifs • If a strong scoring motif is found, use it (example GGTG in A. pernix) • If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs)

  20. Annotated Gene Prediction

  21. Prodigal Scoring

  22. Gene Prediction Problems – Pseudogenes

  23. Pseudogenes – Internal deletion

  24. Pseudogenes – Premature stop codon

  25. Pseudogenes – N-terminal deletion

  26. Pseudogenes – Transposon insertion

  27. Pseudogenes – Multiple frameshifts

  28. Pseudogenes – Premature Stop and Frameshift

  29. Pseudogenes – Dead Start Codon

  30. GENE PAGE

  31. ORGANISM’S (PSYC) COGS LIST

  32. Taxonomic Distribution of Top KEGG BLAST Hits

  33. Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 2

  34. Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 3b

  35. Branched Chain Amino Acid Transporter family

  36. Probable Ancient Gene (Liv Operon)

  37. Branched Chain Amino Acid Transporter family – Rhodopseudomonas palustris

  38. Example of Lateral Transfer

  39. Transporter Gene Loss in Yersina Pestis • 36 Genes involved in transport from YPSE are nonfunctional in YPES • 13 lost due to frameshifts • 11 lost due to deletions • 6 lost due to IS element insertions • 4 (2 pair) lost due to recombination causing deletions and frameshifts • 2 lost due to premature stop codons

  40. Nostoc punctiformeSignal Transduction Histidine Kinases

  41. Nostoc punctiformeSignal Transduction Histidine Kinases

  42. Nostoc punctiformeSignal Transduction Histidine Kinases

  43. Nostoc punctiformeSignal Transduction Histidine Kinases

  44. Nostoc punctiformeRegulatory Proteins

More Related