De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli, Magne Østerås, Jacques Schrenzel Presented by Lucas Lochovsky

Outline • Introduction • Edena’s Methodology • Reducing Read Redundancy • Overlap Graph Construction • Transitive Edge Reduction • Graph Cleanup • Contig Production • Results • Assemblers • Assembly tasks • Additional Edena Analyses • Graph Cleaning Effectiveness • Effective Coverage Depth • Conclusions

1) Introduction • NGS will allow us to explore strange new genomes, blah blah blah…. • WGS assemblers we’ve covered so far: • Medvedev-Brudno assembler • Arachne • AMOS-Cmp • Velvet • ALLPATHS • Think you’ve seen it all?

1) Introduction (cont’d) • Edena: De novo short read assembler • Uses a classic overlap graph approach to assembly • Anyone else get a feeling of déjà vu? • Compare to other recently published NGS read assemblers • De novo assembly of two bacterial genomes sequenced with the Illumina/Solexa platform

2) Edena’s Methodology • Built around a standard overlap-layout-consensus workflow • Opted to use exact matching for overlap detection • Reduce # of spurious overlaps • Faster than using approximate matching • Also assume that all reads have the same length • Is this assumption valid?

2) Edena’s Methodology (cont’d) Four major steps: • Remove redundant reads so that dataset size is more manageable • Overlap detection and overlap graph construction • Graph cleaning: simplification and ambiguity resolution • Produce contigs

2) Edena’s Methodology (cont’d) 1) Practice your 3 R’s: Reducing Read Redundancy • Illumina Genome Analyzer has high amount of over-sampling → many redundant reads • Reduce dataset so it contains only a single copy of each read → non-redundant • Index all reads into a prefix tree • Identical reads will be mapped to the same key → no duplicate reads in this structure

2) Edena’s Methodology (cont’d) • Prefix trees are associative arrays for strings where all descendants of a node have a common prefix • Reads and their reverse complements are considered the same read → merged into the same tree key

2) Edena’s Methodology (cont’d) • Ambiguous reads discarded, since they won’t work with exact matching • Opens up possibility of coverage gaps in read data (not explored by the authors) • Original read data still useful for getting read frequencies • Contig coverage depth • Repeat identification

2) Edena’s Methodology (cont’d) 2) Overlap Graph Construction • Non-redundant read dataset is indexed by a suffix array • Déjà vu moment: Almost exactly like suffix trees from MUMmer/MUMmerGPU! • Information used to produce a bidirected overlap graph • Déjà vu moment: Just like the Medvedev-Brudno assembler! (which I presented!)

2) Edena’s Methodology (cont’d) This slide should be review for all of you! • Bidirected graphs are kind of like directed graphs, except each edge has an orientation on each of its ends • Gives rise to three types of edges: • Edges where one arrow points out of a vertex, and one arrow points into a vertex • Edges with both arrows pointing out, and • Edges with both arrows pointing in (easiest one to do in PowerPoint!) • For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex

2) Edena’s Methodology (cont’d) More review! • In a bidirected overlap graph, each vertex is a double-stranded read • Edges represent read overlaps • Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) • Suppose we have two ds reads r1 and r2 • Each read can be oriented to the left or to the right • The three possible overlaps are: • i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) • ii) r1 points left and r2 points right • iii) r1 points right and r2 points left

2) Edena’s Methodology (cont’d) • Parameter: Minimum overlap size • Sensitivity vs. specificity tradeoff • Small value: Higher frequency of chance overlaps → causes path branching in graph (sensitivity favoured) • Large value: Creates more dead-end (DE) paths, i.e. reads not extended by overlapping reads on one side (specificity favoured)

2) Edena’s Methodology (cont’d) 3a) Transitive Edge Reduction • Simplifies paths by removing nonessential nodes/edges • Generally speaking, a path of the form v1→ v2 → v3 can be reduced to v1 → v3, representing the same sequence with fewer nodes • Reduces graph complexity by the over-sampling rate c = NL/G • N: Number of reads • L:Read length • G: Genome size

2) Edena’s Methodology (cont’d) • For sequences, it’s about removing reads for which another read with the same sequence overlaps the first read to a greater extent

2) Edena’s Methodology (cont’d) 3b) Graph Cleanup • Can have multiple paths branching off a single node (branching paths) • Due to genomic repetitions, sequencing errors, and clonal polymorphisms • Genomic repetitions cannot be fixed without additional information • But the other two can be resolved

2) Edena’s Methodology (cont’d) • Sequencing errors produce short dead-end (DE) paths • Attempt to elongate branching nodes up to a certain depth md (minimum depth) • Reads that cannot be extended to a depth of md are removed • Experimentally determined that md=10 is the best value

2) Edena’s Methodology (cont’d)

2) Edena’s Methodology (cont’d) • Also disambiguate bubbles in the graph caused by single base substitutions (aka “p-bubbles”) • Length of p-bubble is at most ms = 4L - 2T - 1 • L: Read length • T: Min. overlap size • Explore each branching path up to length ms (guaranteed upper bound) • Remove path with less coverage • Polymorphisms can be retained for later analysis

2) Edena’s Methodology (cont’d)

2) Edena’s Methodology (cont’d) 4) Contig Production • If run in strict mode, Edena starts generating contig sequences • In non-strict mode, one more cleaning step is performed • Longer overlaps more reliable than shorter ones • Save only edges at branching nodes that have the highest overlap of all edges • Produce contig sequence by following non-intersecting simple paths in overlap graph • Nodes must have in-degree and out-degree of exactly one

3) Results Survivor: WGS Assembly • Four assemblers • Two challenges • One winner

3) Results (cont’d) Contestant #1: SSAKE • Indexes reads in a prefix tree based upon first eleven 5’ bases • Identify highest possible overlap between pairs of reads • Use most highly-covered reads as starting points for read extension (i.e. assembly “nucleation points”) • So far only used for partial genome sequencing for comparative metagenomic analysis (e.g. bacterial species distinction)

3) Results (cont’d) Contestant #2: Velvet • k-mer/q-gram/k-gram/q-mer de Bruijn graph representation of reads Contestant #3: SHARCGS • Can accept base quality scores along with read data for read filtering (low quality reads discarded) • Also filter out reads with low coverage • Assembly performed with a prefix tree Contestant #4: Edena

3) Results (cont’d) Reward Challenge • Assemble the 2.82 Mbp genome sequence and the 20.7 Kbp plasmid sequence of the Staphylococcus aureus MW2 strain from Illumina reads Immunity Challenge • Assemble 1.55 Mbp genome sequence and the 3.66 Kbp plasmid sequence of the Helicobacter acinonychis Sheeba strain from Illumina reads

3) Results (cont’d) Staphylococcus aureus results • Evaluated each assembler on the parameter configurations that produced the best results • Edena: Min. overlap size: 21 bases • Velvet:k-mer value: 23 • SHARCGS: Max. gap span: 14 • SSAKE: Default parameters

3) Results (cont’d) • Compared contig assembly to published reference sequence • Non-strict mode tends to produce longer contigs at the expense of additional misassemblies • Velvet comparable to Edena strict

3) Results (cont’d) • SHARCGS unable to assemble significant contigs → insufficient coverage depth • SSAKE produced a large number of mismatches mostly at contig boundaries

3) Results (cont’d) • Authors also tried combining contig results from Edena and Velvet due to significant overlaps between their contigs • N50 and mean contig size increased relative to original results • Edena non-strict has similar influence on results as previously

3) Results (cont’d) Helicobacter acinonychis results • Best parameter settings: • Edena: Min. overlap size: 27 (strict), 26 (non-strict) • Velvet:k-mer value: 27 • SHARCGS: Max. gap span: 10 (also must remove last four bases from each read) • SSAKE: Default parameters

3) Results (cont’d) • Results similar to those from the previous assembly challenge

3) Results (cont’d) Survivor: WGS Assembly Conclusion • Granted Immunity: Edena, Velvet • Sent to the Tribal Council: SSAKE, SHARCGS

4) Additional Edena Analyses Graph Cleaning Effectiveness • Demonstrate the effectiveness of DE path removal and p-bubble fixing • Created an ideal read pool from the S. aureus MW2 strain • Consists of one read at every possible position • No errors • No polymorphisms • Distinguish between positive and negative reads • Positive reads have at least one exact occurrence in the reference sequence • Negative reads have none

4) Additional Edena Analyses (cont’d) • Ideal dataset indicates branching nodes and p-bubbles caused by genomic repetition • Anomalies in real datasets only due to negative reads • Due to small quantity of branching nodes in the ideal dataset, branch removal procedure is extremely effective

4) Additional Edena Analyses (cont’d) • Though many p-bubbles consist of sequences made of negative reads, most cannot be explained by base calling errors • Thought to correspond to underrepresented clonal polymorphisms

4) Additional Edena Analyses (cont’d) • Since there are no DE paths in the ideal dataset, expect that DE removal should remove all DE paths in real dataset (i.e. dead-ends correspond to negative reads) • From tests with different md values (below), authors decided 10 was best • Not so clear-cut to me

4) Additional Edena Analyses (cont’d) • Most DE paths have length 1 • Correspond to paths created by base calling errors • Longer DE paths exist that do not appear to be caused by such errors • Thought to be clonal polymorphisms in low abundance → can’t form a complete p-bubble

4) Additional Edena Analyses (cont’d) Effective Coverage Depth • Computed effective coverage depth according to formula from Lander and Waterman • E = N(L-T)/G • N: # of usable reads • L: Read length • T: Req. overlap length • G: Genome size • Can also estimate gaps in read coverage with N•e-E

4) Additional Edena Analyses (cont’d) • S. aureus sequencing • Raw coverage depth: 48x • Effective coverage depth: 14x • H. acinonychis sequencing • Raw coverage depth: 284x • Effective coverage depth: 36x • Statistics imply that there should be no gaps in H. acinonychis assembly, and only a few in S. aureus • But each actual assembly contained several hundred gaps

4) Additional Edena Analyses (cont’d) • Statistics assume uniform read sampling • Investigated underrepresented parts of genomes • After alignment of reads to reference genome, extracted low coverage sequences • These sequences have complex motifs and single base repeats → cause difficulty in replication

5) Conclusions • Edena holds up well against other recent assemblers, in both assembly quality and computational resources • Some assemblers are partially complementary to each other (Edena and Velvet) → can use together to produce results better than each individual assembler’s results • Rise of NGS paired read data will help produce longer contigs and clean up ambiguities

Is Edena The One? The One that will herald the beginning of cost- effective whole genome assembly with NGS? Maybe you should ask the Oracle…

That’s all folks! Discussion Questions • What were the strengths/weaknesses of the Edena? How would you improve it? • How do you think Edena compares to the other assemblers tested? Would you test it against other assemblers not tested here? • Given Edena’s limitations, would you trust it for de novo genome assemblyover traditional sequence assembly? • Why did we have to discuss yet another NGS genome assembler today?

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer