Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 2Genome Assembly Michael Brudno Informatics on High Throughput Sequencing Data July 2009

Questions to Ask • What do you want to learn? • Do you have a finished similar genome? • Do you have matepair data? • Do you have other read types? • What compute power do you have available?

What are your options • What do you want to learn? • -- Variation (within a species) • -- Conservation (between species) • You can try to build a genome from scratch • You can do reference assisted assembly • -- However you will miss some differences • You can map reads to a reference genome • -- Great if you want to discover variation • -- What if you don’t have a reference?

Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing Dealing with non-uniqueness in the genome: resequenceability Mapping versus Assembly

Whole Genome Shotgun Sequencing DNA SEQUENCER Sanger vs. NGS reads ASSEMBLER C++ contigs FINISHING Insert Size ± Error sequence

Assembler • Input: set of strings over {A,C,G,T} called reads • Output: A common superstring of the reads. • {TACAT,CATAC, ACGTAC} TACATACGTAC • Initially: Shortest Common Superstring (SCS) • NP-hard [Gallant et al 1980] • Over-collapsing of repeats • Alternate approaches • de Bruijn graphs [Pevzner, Tang, Waterman 01] • string graphs [Myers 05] • Both formulations are NP-hard.

The Theory 0% 50% 100%

CA GC AG TC AT TT De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Nodes are (k-1)-mers • Edges are k-mers • The set of k-mers is called a k-spectrum • Finding shortest string with given k-spectrum equivalent to Chinese Postman Pevzner 1989

CA GC AG TC AT TT Chinese Postman Tours {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Solving Chinese Postman: An Eulerian tour is a solution • Euleriazation: make a graph Eulerian • Can be done with min cost flow: • Unbalanced nodes are sources/sinks • Duplicate all edges used in flow Pevzner 1989

Modeling Double-Strandedness 3’ 5’ GGCAAT • How can two DNA molecules overlap? ATTGCC 5’ 3’ A A C +CTT +AAC -AAG -GTT C T T A A C +TCG +AAC -CGA -GTT T C G T G G +TGG +AAC -CCA -GTT A A C Kececioglu 1992

+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected Graphs • A walk has to “match” directions at each node. • GGCAAT • ATTGCC

{GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ GTTGGCAAT GC GT CC CA TG GC GG AT TT GG AC CC GT TT CA TG AC AT AA Using de Bruijn Graphs • The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989] {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’ AA

The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest DNA molecule with the given k-spectrum {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ +AC +AA GTTGGCAAT +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Using Bidirected de Bruijn Graphs {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’

Motivation: String Graphs • Several downsides of the de Bruijn approach • Division into k-mers arbitrary • Very sensitive to sequencing errors • Not memory efficient (one node per k-mer) • Goal • One node per read (or better) • No division into k-mers • Flexibility in the presence of sequencing errors Myers 2005

ACGTAC CATAC TACAT How To Build A String Graph (1) TACATACGTAC {ACGTAC, CATAC, TACAT} • Nodes are reads • Edges are overlaps • Weights are lengths of non-overlapping prefix • Transitively inferable overlaps 3 5 3 2 2

How To Build A String Graph(2) • Build overlap graph • Remove transitively inferable overlaps • Collapse Chains Required Optional Goal: Find the shortest path using all of the required edges and any optional ones Myers 2005

Resolving Conflict Nodes (Repeats) With Mate Pairs A B A’ B’ B A A’ B’ ? Does there exist a short path between A’ and B’? • Dijkstra’s shortest path algorithm -- bounded • Greedily join edges if they have enough supporting reads.

repeat region From Paths To Contigs Merge reads up to potential repeat boundaries Unique Contig Overcollapsed Contig

Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

Some Comparisons

Assemblers • Previous (Sanger) Assemblers • NGS Assemblers • SSAKE (Jeck et al., 2007) • VCAKE (Warren et al. 2007) • SHARCGS (Dohm et al. 2007) • Shorty (Chen and Skiena 2007) • ALLPATHS (Butler et al. 2008) • Edena (Hernandez et al. 2008) • Euler-(U)SR (Chaisson and Pevzner 2008, 2009) • Velvet (Zerbino and Birney, 2008)

Assembling Color-Space • Overlap & layout are pretty much the same • Calling consensus is tricky: T1200320010202312010100223 T1200320010202312010100223 T1023120101002213013300131 T0120101002213013300131120 T3201010022130133001311232 0101002213013300131120210T 0100223301330013112021101T 0022130133001311202100001T 2213013300131120211000010T T0001311222110000123202211 T120032001020231201010022130133001311202110000103202211 AAATCCCAAGGATGAACCAAAGACGGTATTTGCACTTCACCCCCAGCTNNNNNN

AB SOLiD: Dibase Sequencing AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0

Look at: 012233102 Recall: 012033102 4 translations for every color sequence AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA 0 0 2 A G 1 3 3 1 C T 2 0 0

Our Hidden Markov Model (for colors) • At every pair of consecutive positions: • don’t know the donor nucleotides, • have some color-space and/or letter-space reads • The donor could be: • letters: AA color 0 • letters: AC color 1 • : • letters: TT color 0 • 16 combinations Note: AA and TT give the same colors! So we have redundancy.

Colors and Letters letters: AA color 0 AA and TT give the same colors! So we have redundancy. letters: TT color 0 • can’t just call colors, since they can represent one of several translations • to properly call SNPs, we need to model underlying letters.

States of the Model Consider donor at positions 532, 533 and 534. At each pair we have one color, two letters … X Y Z … position 532/3 533/4 AA AA 16 states only certain transitions allowed each state depends on the previous states, but not further (Markov Process) : . . . . . . . . . AT CA : CT GA : : TT TT

Emissions Unknown genome …NNNNNNNNNNNNNN… T01020100311223 T1030101311223 T20100311223 Color reads color emissions T20100311223C letter emissions Letters

HMM Emissions Emissions AA TT emission probability color 0 1 – ε/3 color 1 ε Same distribution of emissions in color-space color 2 ε color 3 ε letters A (1- ξ/3) letters C ξ different emissions in letter-space letters G ξ letters T ξ

Emissions Probability …NNNNNNNNNNNNNN… How do we use emissions? Assign an Emission Probability to each state: What is the probability that this state emitted these reads. T01020100311223 T1030101311223 T20100311223 E.g. For state CC: T20100311223C

Just run Forward-Backward! • Have set-up a form of an HMM • run Forward-Backward algorithm • get probability distribution over states AA : likely state AT CA : CT GA : : TT

ADiR Color-Space Contigs T1200320010202312010100223 T1200320010202312010100223 T1023120101002213013300131 T0120101002213013300131120 T3201010022130133001311232 0101002213013300131120210T 0100223301330013112021101T 0022130133001311202100001T 2213013300131120211000010T T0001311222110000123202211 AAATCCCAAGGATGAACCAAAGACGGTATTTGCACTTCACCCCCAGCTNNNNNN 677999999999999999999999999999999999999999999322211110

Assembly For Different Data • For shorter reads, de Bruijn graphs capture most of the information • For longer ones, string graph-like approaches are better • For de novo, read length is key – many efforts to combine 454 and SOLiD/Solexa data • Color space is tricky, but can be handled with the right tools.

Thank you for your attention Michael Brudno University of Toronto

1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 Simpifying String Graphs Repeatedly collapse vertices using 4 simple rules 2 1 1 2 1 1 2 1 1 2 1 1 1 1 After all collapsals are performed, the remaining nodes (conflict nodes) all have in and out degree of at least 2.

Canadian Bioinformatics Workshops