Parallel de novo Assembly of Large Genomes from Paired Short Reads

Parallel de novo Assembly of Large Genomes from Paired Short Reads Srinivas Aluru Electrical and Computer Engineering Iowa State University Computer Science and Engineering Indian Institute of Technology Bombay

Sequencing Technology (1977-2005) Sanger Sequencing • 500-1000 bp per run • Up to 96 runs ABI 3700

Sequencing long DNA Multiple copies of the same source Sequence Unordered genome fragments Randomly fragment the copies Iowa State University

Next-gen Sequencing 454 36 –> 50 –> 75 bp 100 – 200 million reads 400-500 bp 1.25 million reads Polonator • 26 bp • 200 – 400 million reads Illumina Genome Analyzer

High-throughput sequencing • 10-fold increase per year • Throughput of Solid sequencer doubling every quarter recently • In 2008, Illumina customers sequenced ≈10 X GenBank size in 2005 (Science 2009) • $15 million for 5-7X of Sanger vs. $48K for 30X of Illumina for human genome (June 2009)

The Resulting Data Deluge Broad this year: 50 Gbases in 12 days ~4GB/instrument/day 98 instruments 400 Gbases/day 4 TB/day raw data 1.5 Petabytes/year Broad next year: 200 Gbases in 8 days 25GB/instrument/day 100 instruments 2.5 Tbases/day 28 TB/day raw data 10 Petabytes/year CERN 15 PB/year

Exact Matching Filter Directly detectpromising pairs Sanger: De novo genome assembly Overlap-Layout-Consensus pairs ≡O(n2l2) run-time O(n) pairs O(nl2) run-time Iowa State University

Genome Sequencing - Examples • Human; $1 B+ (Venter et al. 2001): • 10,000 CPU hours for overlaps • 10,000 CPU hours for the rest • Parallel assembler (3 months  1 day; Kalyanaraman & Aluru JPDC 2007) • B73 maize genome; $30 million (WSU, UAZ, ISU, CSHL; Science Nov. 2009) Iowa State University

Next-Gen: Graph based assembly TACG ACGTAAC Build de Bruijn graph from reads. Each read is a path Genome is a super path Use paired read constraints as guide CGTA CGTA CGTA GTAA ACGT ACGT ACGT GTAA GTAA GTAA TAAC TAAC TAAC AACG TAAC

Rapidly Growing Area • ALLPATHS (Butler et al. 2008) • EULER-SR (Chaissonet al. 2008) • SHARCGS (Dohmet al. 2008) • Velvet (Zerbino and Birney 2008) • Bidirected graphs (Medvedev & Brudno 2008) • YAGA (Jackson, Schnable and Aluru 2009) • ABySS (Simpson et al. 2009) Key Challenge: memory-intensive graph models for high quality + high-coverage to compensate for short reads

Generating Paired Reads GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGAC AGGGCTTAGACGGT AGGGCTTAGACGGT AGGGCTTAGACGGTG GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGACGGTG AGGGCTTAGAC GGTGACGGTGACGTTGGGAGCGGAGCG GGTGACGGTGACGTTGGGAGCGGAGCG AGCGGAGCG AGCGGAGCG CCCAACGTCACCGTC CGCTCCGCTCCCAACGTCACCGT CGCTCCGCTCCCAACGTCACCGT ACCGTCTAAGCCCT ACCGTCTAAGCCCT CCCAACGTCACCGTC GTCTAAGCCCT GTCTAAGCCCT CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCT CGCTCCGCT CACCGTCTAAGCCCT CACCGTCTAAGCCCT

Multiple Fragment Types Minimum Fragment Size Maximum Fragment Size Minimum Read Length Maximum Read Length

Error Correction K-spectrum approach (Chaissonet al., Bioinformatics 2008) SHREC (Shroderet al., Bioinformatics 2009) Reptile (Yang and Aluru, 2010)

Importance of Error Correction BartonellaHenselae 330X coverage, 36bp Illumina reads

Bidirected de Bruijn Graph + CGCCGAC TCGCCGAC CGCCGACT GCCGACTA CCGACTAC CGACTACT CGCCGAC GTAGTCGC TAGTCGGC AGTAGTCG GTCGGCGA AGTCGGCG GTCGGCG GTCGGCG - TCGCCGACTACT - AGTAGTCGGCGA + - + + + - -

Bidirected String Graph • Edges labeled with two strings CGACTACT GCCGACTA CCGACTAC TCGCCGAC CGCCGACT AGTAGTCG GTAGTCGC GTCGGCGA AGTCGGCG TAGTCGGC T/A C/T TCGC/AGTA G/G C/A

Parallel Graph Representation • Distributed tuple List for edges • <u, v, du, dv, Cu, Cv> • <v, u, dv, du, Cv, Cu> • 2k bits to represent node IDs • Sorting tuples: • by node label: edges incident to a node in local memory • by (smaller node label, larger node label): both tuples corresponding to an edge in local memory

Parallel Graph Construction • Reads are partitioned to processors based on total size • Each processor generates (k+1)-molecules (edges) from its short reads • Sort to eliminate duplicates and keep frequency count

Graph Compaction • Compact chains into single edges • Reduction to familiar list ranking • List ranking: • Distributed linked list with adjacency information • Find distance from each node to end of list • Extend: • Multiple linked lists • Identify nodes with multiple edges • Undirected

Transforming Edge Compaction to List Ranking: The transformation can be accomplished using a constant number of parallel sorts.

Errors Detection and Removal Bubble Spurious Link Tip Singleton k k k k TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT

Repeats and the String Graph Repeat 1 Repeat 2 While a path through the graph corresponds to the genome, it is not obvious which path it is Paired reads provide approximate distance constraints that can be used to choose between alternate traversals

Measuring path support • Given a path in the graph • Each (k+1)-molecule in a read maps to some position in the graph • Can check to see if the observed distance in the path matches distance constraints

Clustering Path Support • Important idea: can specify the amount of expected support for two edges in a path • For each fragment type, much of some range should be supported • Can cluster observed support into support ranges and check to see if these observed ranges match expected ranges • Use single linkage clustering to combine multiple ranges into a single unit • Key Idea: Clustering remains invariant in the face of intervening path changes • can be precalculated prior to graph traversal • These “partial” clusters can be converted to true clusters during path walking

Creating a path extension “lock” For each edge in the path, we create an expected support “lock” given a candidate edge extension:

Finding a Key (cluster) that Fits • Convert precalculated “partial” clusters to clusters specific to this path • Check calculated clusters against the lock: • Fits if: • Lock is a subrange of key • Key fits “tightly” against one side of the lock Fit No Fit

Parallel Short Sequence Assembler • YAGA (Yet Another Genome Assembler) • 12,000 line C++ implementation • A parallel support library built on MPI • A runtime library that allows alternate implementation of underlying functionality (eases debugging) • Computational methods as permutations and transformations on arrays of tuples. Assembler Computation Primitives Runtime Library Parallel Serial MPI 80 Runtime Perfect Speedup 40 20 10 5 16 32 64 128 256

Experimental Results • Synthetic data from previously assembled genomes • Protocol I – 900±100, 4300±600 • Protocol II - 330±30, 660±60, 1100±100, 2200±200, 4400±400 • 30-50bp reads, 1% error rate, and 300X coverage

Experimental Results NX score is the maximum length L such that at least X percent of the genome is covered by assembled contigs of length L

Performance Results (Blue Gene/L) D. Melanogaster • ~900 million reads and 512 processors • 50 mins. for data transfer • 100 mins. for the parallel phases • 20 mins. for remaining

Conclusions and Future Work • HPC needed to scale graph methods to large genomes • Method applicable to heterogeneous reads • Can we do novo assemble large mammalian and plant genomes? • Error correction for repeat-rich genomes? • Parallel path walking? • Improve assembly using multiple parameters? • Scaffolding using graph and (k+1) pair clusters?

Acknowledgements

Parallel de novo Assembly of Large Genomes from Paired Short Reads

Parallel de novo Assembly of Large Genomes from Paired Short Reads

Presentation Transcript

De Novo Genome Assembly Using vSMP

De novo assembly from Illumina

VELVET- Using de Bruijn graphs for the de novo sequence assembly of very short reads

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets

PE-Assembler: De novo assembler using short paired-end reads

Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads

De novo identification of repeat families in large genomes

Detecting Copy Number Variation With Short Paired Reads

De novo assembly of RNA

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities

De-novo Assembly

De novo assembly from clinical sample

Assembly of Solexa tomato reads

De novo genome assembly

De Novo Genome Assembly - Introduction

De Novo Genome Assembly - Introduction

De novo assembly validation

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs