1 / 32

Parallel de novo Assembly of Large Genomes from Paired Short Reads

Parallel de novo Assembly of Large Genomes from Paired Short Reads. Srinivas Aluru Electrical and Computer Engineering Iowa State University Computer Science and Engineering Indian Institute of Technology Bombay. Sequencing Technology (1977-2005). Sanger Sequencing 500-1000 bp per run

hallam
Télécharger la présentation

Parallel de novo Assembly of Large Genomes from Paired Short Reads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel de novo Assembly of Large Genomes from Paired Short Reads Srinivas Aluru Electrical and Computer Engineering Iowa State University Computer Science and Engineering Indian Institute of Technology Bombay

  2. Sequencing Technology (1977-2005) Sanger Sequencing • 500-1000 bp per run • Up to 96 runs ABI 3700

  3. Sequencing long DNA Multiple copies of the same source Sequence Unordered genome fragments Randomly fragment the copies Iowa State University

  4. Next-gen Sequencing 454 36 –> 50 –> 75 bp 100 – 200 million reads 400-500 bp 1.25 million reads Polonator • 26 bp • 200 – 400 million reads Illumina Genome Analyzer

  5. High-throughput sequencing • 10-fold increase per year • Throughput of Solid sequencer doubling every quarter recently • In 2008, Illumina customers sequenced ≈10 X GenBank size in 2005 (Science 2009) • $15 million for 5-7X of Sanger vs. $48K for 30X of Illumina for human genome (June 2009)

  6. The Resulting Data Deluge Broad this year: 50 Gbases in 12 days ~4GB/instrument/day 98 instruments 400 Gbases/day 4 TB/day raw data 1.5 Petabytes/year Broad next year: 200 Gbases in 8 days 25GB/instrument/day 100 instruments 2.5 Tbases/day 28 TB/day raw data 10 Petabytes/year CERN 15 PB/year

  7. Exact Matching Filter Directly detectpromising pairs Sanger: De novo genome assembly Overlap-Layout-Consensus pairs ≡O(n2l2) run-time O(n) pairs O(nl2) run-time Iowa State University

  8. Genome Sequencing - Examples • Human; $1 B+ (Venter et al. 2001): • 10,000 CPU hours for overlaps • 10,000 CPU hours for the rest • Parallel assembler (3 months  1 day; Kalyanaraman & Aluru JPDC 2007) • B73 maize genome; $30 million (WSU, UAZ, ISU, CSHL; Science Nov. 2009) Iowa State University

  9. Next-Gen: Graph based assembly TACG ACGTAAC Build de Bruijn graph from reads. Each read is a path Genome is a super path Use paired read constraints as guide CGTA CGTA CGTA GTAA ACGT ACGT ACGT GTAA GTAA GTAA TAAC TAAC TAAC AACG TAAC

  10. Rapidly Growing Area • ALLPATHS (Butler et al. 2008) • EULER-SR (Chaissonet al. 2008) • SHARCGS (Dohmet al. 2008) • Velvet (Zerbino and Birney 2008) • Bidirected graphs (Medvedev & Brudno 2008) • YAGA (Jackson, Schnable and Aluru 2009) • ABySS (Simpson et al. 2009) Key Challenge: memory-intensive graph models for high quality + high-coverage to compensate for short reads

  11. Generating Paired Reads GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGAC AGGGCTTAGACGGT AGGGCTTAGACGGT AGGGCTTAGACGGTG GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGACGGTG AGGGCTTAGAC GGTGACGGTGACGTTGGGAGCGGAGCG GGTGACGGTGACGTTGGGAGCGGAGCG AGCGGAGCG AGCGGAGCG CCCAACGTCACCGTC CGCTCCGCTCCCAACGTCACCGT CGCTCCGCTCCCAACGTCACCGT ACCGTCTAAGCCCT ACCGTCTAAGCCCT CCCAACGTCACCGTC GTCTAAGCCCT GTCTAAGCCCT CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCT CGCTCCGCT CACCGTCTAAGCCCT CACCGTCTAAGCCCT

  12. Multiple Fragment Types Minimum Fragment Size Maximum Fragment Size Minimum Read Length Maximum Read Length

  13. Error Correction K-spectrum approach (Chaissonet al., Bioinformatics 2008) SHREC (Shroderet al., Bioinformatics 2009) Reptile (Yang and Aluru, 2010)

  14. Importance of Error Correction BartonellaHenselae 330X coverage, 36bp Illumina reads

  15. Bidirected de Bruijn Graph + CGCCGAC TCGCCGAC CGCCGACT GCCGACTA CCGACTAC CGACTACT CGCCGAC GTAGTCGC TAGTCGGC AGTAGTCG GTCGGCGA AGTCGGCG GTCGGCG GTCGGCG - TCGCCGACTACT - AGTAGTCGGCGA + - + + + - -

  16. Bidirected String Graph • Edges labeled with two strings CGACTACT GCCGACTA CCGACTAC TCGCCGAC CGCCGACT AGTAGTCG GTAGTCGC GTCGGCGA AGTCGGCG TAGTCGGC T/A C/T TCGC/AGTA G/G C/A

  17. Parallel Graph Representation • Distributed tuple List for edges • <u, v, du, dv, Cu, Cv> • <v, u, dv, du, Cv, Cu> • 2k bits to represent node IDs • Sorting tuples: • by node label: edges incident to a node in local memory • by (smaller node label, larger node label): both tuples corresponding to an edge in local memory

  18. Parallel Graph Construction • Reads are partitioned to processors based on total size • Each processor generates (k+1)-molecules (edges) from its short reads • Sort to eliminate duplicates and keep frequency count

  19. Graph Compaction • Compact chains into single edges • Reduction to familiar list ranking • List ranking: • Distributed linked list with adjacency information • Find distance from each node to end of list • Extend: • Multiple linked lists • Identify nodes with multiple edges • Undirected

  20. Transforming Edge Compaction to List Ranking: The transformation can be accomplished using a constant number of parallel sorts.

  21. Errors Detection and Removal Bubble Spurious Link Tip Singleton k k k k TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT

  22. Repeats and the String Graph Repeat 1 Repeat 2 While a path through the graph corresponds to the genome, it is not obvious which path it is Paired reads provide approximate distance constraints that can be used to choose between alternate traversals

  23. Measuring path support • Given a path in the graph • Each (k+1)-molecule in a read maps to some position in the graph • Can check to see if the observed distance in the path matches distance constraints

  24. Clustering Path Support • Important idea: can specify the amount of expected support for two edges in a path • For each fragment type, much of some range should be supported • Can cluster observed support into support ranges and check to see if these observed ranges match expected ranges • Use single linkage clustering to combine multiple ranges into a single unit • Key Idea: Clustering remains invariant in the face of intervening path changes • can be precalculated prior to graph traversal • These “partial” clusters can be converted to true clusters during path walking

  25. Creating a path extension “lock” For each edge in the path, we create an expected support “lock” given a candidate edge extension:

  26. Finding a Key (cluster) that Fits • Convert precalculated “partial” clusters to clusters specific to this path • Check calculated clusters against the lock: • Fits if: • Lock is a subrange of key • Key fits “tightly” against one side of the lock Fit No Fit

  27. Parallel Short Sequence Assembler • YAGA (Yet Another Genome Assembler) • 12,000 line C++ implementation • A parallel support library built on MPI • A runtime library that allows alternate implementation of underlying functionality (eases debugging) • Computational methods as permutations and transformations on arrays of tuples. Assembler Computation Primitives Runtime Library Parallel Serial MPI 80 Runtime Perfect Speedup 40 20 10 5 16 32 64 128 256

  28. Experimental Results • Synthetic data from previously assembled genomes • Protocol I – 900±100, 4300±600 • Protocol II - 330±30, 660±60, 1100±100, 2200±200, 4400±400 • 30-50bp reads, 1% error rate, and 300X coverage

  29. Experimental Results NX score is the maximum length L such that at least X percent of the genome is covered by assembled contigs of length L

  30. Performance Results (Blue Gene/L) D. Melanogaster • ~900 million reads and 512 processors • 50 mins. for data transfer • 100 mins. for the parallel phases • 20 mins. for remaining

  31. Conclusions and Future Work • HPC needed to scale graph methods to large genomes • Method applicable to heterogeneous reads • Can we do novo assemble large mammalian and plant genomes? • Error correction for repeat-rich genomes? • Parallel path walking? • Improve assembly using multiple parameters? • Scaffolding using graph and (k+1) pair clusters?

  32. Acknowledgements

More Related