1 / 22

Genovo : De Novo Assembly for Metagenomes

Genovo : De Novo Assembly for Metagenomes. Gao Song 2010/07/14. Outline. Overview of Metagenomices Current Assemblers Genovo Assembly. Overview of Metagemices. Motivation. Metagenomics is: Why Do We Need Metagenomics ? Snapshot of bacterial community Cannot be cultivated. <1%.

jessie
Télécharger la présentation

Genovo : De Novo Assembly for Metagenomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genovo: De Novo Assembly for Metagenomes Gao Song 2010/07/14

  2. Outline • Overview of Metagenomices • Current Assemblers • Genovo Assembly

  3. Overview of Metagemices

  4. Motivation • Metagenomics is: • Why Do We Need Metagenomics? • Snapshot of bacterial community • Cannot be cultivated <1%

  5. Applications • Monitoring the impact of pollutants on ecosystems • Discovery of new genes, enzymes… - Global Ocean Sampling Expedition • Human Microbiome Project • JGI sequenced Acid Mine Drainage sample

  6. Two Paradigms • Marker Gene Sequencing • 16s rRNA: • Two ways • Other marker genes: RuBisCo, NifH • Only composition • Whole Genome Sequencing (WGS) • Detailed picture of community

  7. Complex Communities X5000 >1000 200L 1million

  8. Current Assembler

  9. Current Status • Why not assemble reads? • ORFome assembler* • Three steps: • The putative ORFs are annotated for each read • ORFs are assembled using EULER • ORF homologs are searched for in Integrated Microbial Genomics (IMG) database • Existing WGS assemblers • Sanger reads: Phrap, Celera, Arachne, JAZZ… • Short reads: Velvet, Newbler… * Y. Ye and H. Tang, "An orfome assembly approach to metagenomics sequences analysis." Journal of bioinformatics and computational biology, vol. 7, no. 3, pp. 455-471, June 2009

  10. Genovo: De Novo Assembly for Metagenomes Jonathan Laserson, Vladimir Jojicand Daphne Koller. RECOMB 2010, LNBI 6044, pp. 341-356, 2010

  11. Main Idea • Propose a generative model for Metagenome data • Using iterated conditional modes (ICM) • Using hill-climbing steps iteratively • Design a score for evaluation

  12. Model • Initialize contigs: • Infinite contigs with infinite length • Partition the reads • Using Chinese Restaurant Process

  13. Model • Generate the starting point oi • Generate the length of read • Quality of assembly of each read

  14. Algorithm • Using ICM • Starting from initial condition, hill-climbing moves are performed iteratively • Move 1: Consensus Sequence: • Select the most frequent base

  15. Algorithm • Move 2: Read Mapping • For read i, first remove it, then recalculate its contig and alignment • First, for each potential location, compute alignment • Then, select the location according to possibility • Filtering: using common 10-mer

  16. Algorithm • Move 3: update geometric variable -> • Globle moves: • Propose indels • Center • Merge contigs • Chimeric reads • Disassemble the dangling contigs

  17. Evaluation • BLAST • PFAM • Designed score • 1stterm: quality of assembly • 2nd term: penalty for total length • 3rd term: prefer to merge when V>V0

  18. Results • Using 454 reads • Compare with Newbler, Velvet and EULER-SR • Single Genome

  19. Result • Metagenome data • Score • PFAM

  20. Discussion • New idea • Apply a mature algorithm to assembly domain • Systematically describe and analyze the problem and algorithm • Results are better

  21. Discussion • Slowly: minute vs. hours for 300k 454 reads • Main idea: try to extend as long as possible, so they will have more hits for BLAST • Why choose 20 for V0? • How to deal with branching? Repeats? • Model: • Why it can capture the property of metagenomic data? • How to argue the correctness of that model? • The distribution of starting points

  22. Thank you

More Related