Assembling and comparing genomes or transcriptomes using de Bruijn graphs

Assembling and comparing genomes or transcriptomes using de Bruijn graphs Daniel Zerbino, Marcel Schulz, Wendy Lee, Ewan Birney and David Haussler

Outline • De Bruijn graphs Construction and error correction • Repeat resolution Resolving repeats using long reads or paired-end reads • Practical considerations • Extensions to Velvet Assembling RNAseq into full length transcripts. Assisted de novo assembly with de Bruijn graphs.

De Bruijn graphs Construction and error correction

A quick example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG

Comparison of different assembly techniques • Overlap-Layout-Consensus (traditional) • Traditional assemblers: Phrap, Arachne, Celera etc. • Short reads: Edena • Generally more expensive computationally • Pairwise global alignments • However, as reads get longer (>300bp ?) produce better results • They use the alignments of entire reads not isolated k-mer overlaps

A quick example AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG

A quick example AGTCGAG CTTTAGA CGATGAG GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT TAGAGAA TAGTCGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA GCTTTAG TCCGATG TCGACGC GATCCGA GATGAGG TCTAGAT AGGCTTT GGCTTTA TAGATCC

A quick example TAGTCGA AGTCGAG GTCGAGG CGAGGCT GAGGCTC AGGCTTT TCTAGAT GGCTTTA TTAGATC GCTTTAG TAGATCC CTTTAGA AGATCCG GATCCGA ATCCGAT TCCGATG CCGATGA TTAGAGA CGATGAG TAGAGAA GATGAGG AGAGACA ATGAGGC GAGACAG TGAGGCT

Comparison of different assembly techniques • Greedy hash table searches (~2007) • Transitional assemblers: SSAKE, SHARCGS, VCAKE • Much faster & leaner than the previous • Just a k-mer hash • Less robust to variation and noise (i.e. shorter and/or less reliable contigs) • A local exploration of overlapping k-mers • Essentially simplified de Bruijn graph assemblers

TCGA (1x)‏ CGAG (1x)‏ A quick example One read: GTCGAGG GTCG (1x)‏ GAGG (1x)‏

A quick example All the others… GATT (1x)‏ GATC (8x)‏ TGAG (9x)‏ ATGA (8x)‏ GATG (5x)‏ CGAT (6x)‏ CCGA (7x)‏ TCCG (7x)‏ ATCC (7x)‏ AGAT (8x)‏ AGAA (1x)‏ GCTC (2x)‏ CTCT (1x)‏ TCTA (2x)‏ CTAG (2x)‏ GGCT (11x)‏ AGAC (9x)‏ TAGA (16x)‏ TAGT (3x)‏ AGTC (7x)‏ GTCG (9x)‏ TCGA (10x)‏ CGAG (8x)‏ GAGG (16x)‏ AGGC (16x)‏ AGAG (9x)‏ GAGA (12x)‏ GACA (8x)‏ ACAG (5x)‏ TTAG (12x)‏ CTTT (8x)‏ TTTA (8x)‏ GCTT (8x)‏ ACGC (1x)‏ CGAC (1x)‏ GACG (1x)‏

Comparison of different assembly techniques • De Bruijn graph assemblers • Currently the most common for NGS: Euler, Allpaths, Velvet, ABySS, SOAPdenovo • Essentially a compromise between the two previous approaches: • Just a k-mer hash • … but processed globally • The first part (graph construction & correction) is essentially common to all these assemblers, with a few implementation differences (e.g. parallelisation in ABySS)

A quick example All the others… GATT (1x)‏ GATC (8x)‏ TGAG (9x)‏ ATGA (8x)‏ GATG (5x)‏ CGAT (6x)‏ CCGA (7x)‏ TCCG (7x)‏ ATCC (7x)‏ AGAT (8x)‏ AGAA (1x)‏ GCTC (2x)‏ CTCT (1x)‏ TCTA (2x)‏ CTAG (2x)‏ GGCT (11x)‏ AGAC (9x)‏ TAGA (16x)‏ TAGT (3x)‏ AGTC (7x)‏ GTCG (9x)‏ TCGA (10x)‏ CGAG (8x)‏ GAGG (16x)‏ AGGC (16x)‏ AGAG (9x)‏ GAGA (12x)‏ GACA (8x)‏ ACAG (5x)‏ TTAG (12x)‏ CTTT (8x)‏ TTTA (8x)‏ GCTT (8x)‏ ACGC (1x)‏ CGAC (1x)‏ GACG (1x)‏

A quick example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG CGAG TAGTCGA TAGA GAGGCT AGAGA AGACAG GCTTTAG CGACGC

Error removal Tips removed… AGAT GATCCGATGAG GCTCTAG CGAG TAGTCGA TAGA GAGGCT AGAGA AGACAG GCTTTAG

Error removal Bubbles removed (Tour Bus)… AGAT GATCCGATGAG CGAG TAGTCGA TAGA GCTTTAG GAGGCT AGAGA AGACAG

Error removal Final simplification… AGATCCGATGAG AGAGACAG TAGTCGAG GAGGCTTTAGA

So what’s the difference? • Algebraic difference: • Reads in the OLC methods are atomic (i.e. a arc) • Reads in the DB graph are sequential paths through the graph • … this leads to practical differences: • DB graphs allow for a greater variety of overlaps. • Overlaps in the OLC approach require a global alignment, not just a shared k-mer

Simulations: Tour Bus - error removal

Repeat resolution Resolving repeats using long reads or paired-end reads

Scaling up the hard way: chromosome X • 548 million Solexa reads were generated from a flow-sorted human X chromosome. • Fit in 70GB of RAM. • Many contigs: 898,401 contigs • Short contigs: 260bp N50 (however max of 6,956bp) • Overall length: 130Mb. • Moral: there are engineering issues to be resolved but the complexity of the graph needs to be handled accordingly. • Reduced representation (Margulies et al.). • Combined re-mapping and de novo sequencing (Cheetham et al., Pleasance et al.). • Code parallelization (Birol et al., Jeffrey Cook) • Improved indexing (Zamin Iqbal and Mario Caccamo). • Use of intermediate re-mapping (Matthias Haimel).

Repeats in a de Bruijn graph

Rock Band: Using long and short reads together A B

Using long and short reads together Zerbino et al, 2009

Different approaches to repeat resolution • Theoretical: spectral graph analysis • Equivalent to a Principal Component Analysis or a heat conduction model • Relies on a (massive) matrix diagonalization • Comprehensive: all the data is integrated at once • Robust: small variations don’t disturb the overall result, • To the best of my knowledge, never used because of the computational cost.

Different approaches to repeat resolution • Traditional scaffolding • E.g. Arachne, Celera, BAMBUS. • Heuristic approach similar to that used in traditional overlap-layout-consensus contigging. • Build a big graph of pairwise connections, simplify, extract obvious linear components.

Different approaches to repeat resolution • In NGS assemblers: • EULER: for each pair of reads, find all possible paths from one read to the other. • ABySS: Same as above, but the read-pairs are bundled into node-to-node connections to reduce calculations • ALLPATHS: Same as above, but the search is limited to localized clouds around pre-computed scaffolds. A B

Different approaches to repeat resolution • Using the differences between insert length • The Shorty algorithm uses the variance between read pairs anchored on a common contig on k-mer.

Pebble: Handling paired-end reads Zerbino et al, 2009

Practical considerations

Colorspace • Di-base encoding has a 4 letter alphabet, but very different behavior to sequence space • Different rules for complementarity • Direct conversion to sequence-space is simple but wasteful • One error messes up all the remaining basepairs • Conversion must therefore be done at the very end of the process, when the reads are aligned • You can then use the transition rules to detect errors

Different error models • When using different technologies, you have to take into account different technologies • Trivial when doing OLC assembly • Much more tricky when doing de Bruijn assembly, since k-mers are not assigned to reads. • Different assemblers have different settings (e.. Newbler)

Pre-filtering the reads • Some assemblers have in-built filtering of the reads (e.g. Euler) but not a generality. • Efficient filtering of low quality bases can cut down on the computational cost (memory & time) • Beware that some assemblers require reads of identical lengths.

Oases: de novo transcriptome assembly How to study mRNA reads which do not map

De Bruijn graphs ~ splice graphs Heber et al, 2002

Oases: Process • Create clusters of contigs: • Connecting reads • Connecting read pairs • Re-implement traditional algorithms to run on graphs, instead of a reference genome • Greedy transitive reduction (Myers, 2005)‏ • Motif searches • Dynamic assembly of transcripts (Lee, 2003)‏

Oases: Output • Full length transcripts – handles alternative splicing • Transcripts from different genes – handles varying coverage levels • Alternative splicing events: when possible, distinguishes between common AS events.

Oases: Results • Benchmarked within the RGASP competition • Already 100+ users, some reporting very promising results: • N50 = 1886 bp (Monica Britton, UC Davis) • 99.8% of alignmed transcripts against unigene (Micha Bayer - Scottish Crop Research Institute)

#S2-DRSC PE-reads # loci # transcripts median length mapped to genome Overlapping ENSEMBL transcripts (strand specific) #identified exons (strand specific)‏ 43,836,085* 32,905 53,299 214 89% 61% (34%) 72,892 (40,272) 51,390,576 32,966 56,639 210 88% 59% (33%) 74,089 (41,372) Oases: results on Drosophila PE-reads Marcel Schulz read length=36 k=21 insert size=200 *only pairs where each read has < 7 bases with phred quality <= 10

transcript connecting 2 genes (locus 2589)‏ Transcript connecting 2 genes (locus 2589)‏ Oases: examples Marcel Schulz Unannotated gene with 7 exons (locus 136)‏

Columbus Assisted de novo assembly with de Bruijn graphs

De novo: No reference needed Handles novel structures without a priori Mapping vsde novo • Mapping: • Many tools available • Easy to compute on a farm • Simpler to analyze How to get the best of both worlds?

Repeats in a de Bruijn graph

Columbus

Columbus • Solution: the reference-based de Bruijn graph • Preserves the structure of reference sequences • Integrates alignment information (SAM/BAM file) • Edits, breaks up, reconnects, and extends the “alignment contigs” as necessary. • Applications: • Comparative transcriptome assembly • Cancer RNAseq • Comparative assembly

Summary • The explicit use of de Bruijn graphs is very convenient to handle next-generation sequencing data and to analyze genomic repeat structures, as well as splicing structures. • Velvet v.0.7 with Pebble is distributed freely under GPL license: www.ebi.ac.uk/~zerbino/velvet • Oases v.0.1 is also available freely under GPL license: www.ebi.ac.uk/~zerbino/oases • Columbus (to be released around summer) will allow users to merge the de novo and mapping paradigms, thus using all the available information efficiently, for both genomic and transcriptomic assembly

Acknowledgements • European Bioinformatics Institute • Ewan Birney • Zamin Iqbal • Mario Caccamo • Grégoire Pau • Tim Massingham • MPI Evol. Gen • Marcel Schulz • UC Santa Cruz • David Haussler • Wendy Lee • Sanger Institute • Richard Durbin • David Carter • Tony Cox • Zemin Ning • Julian Parkhill • Illumina • Clive Brown • David Bentley • Main Velvet collaborators/contributors • Dan MacLean, Jonathan Jones, David Studholme • Simon White, Steven Searle, John Marshall • Paul Havlak, Elliott Margulies • Simon Gladman, Scott Chandry • Lisa Murray, Tony Cox • Lixin Zhou, Mark Chee • Paul Harrison • Korbinian Schneeberger, Detlef Weigel • Michael Chen, Mark Chee, Melissa de la Bastide • Darren Platt • Vrunda Sheth, Craig Cummings

Assembling and comparing genomes or transcriptomes using de Bruijn graphs