590 likes | 650 Vues
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly. http://cs.brown.edu/courses/csci2950-c/. Outline. EULER fragment assembly Mate-pairs, scaffolding and copy number Next-generation DNA Sequencing Cancer Genome Sequencing. cut many times at random. Whole Genome Shotgun Sequencing.
E N D
CSCI2950-C Lecture 3DNA Sequencing and Fragment Assembly http://cs.brown.edu/courses/csci2950-c/
Outline • EULER fragment assembly • Mate-pairs, scaffolding and copy number • Next-generation DNA Sequencing • Cancer Genome Sequencing
cut many times at random Whole Genome Shotgun Sequencing genome plasmids (2 – 10 Kbp) forward-reverse paired reads (mate pair) known dist cosmids (40 Kbp) ~500 bp ~500 bp
Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
Approaches to Fragment Assembly Find a path visiting every VERTEX exactly once in the OVERLAP graph: Hamiltonian path problem NP-complete: algorithms unknown
Approaches to Fragment Assembly (cont’d) Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Linear time algorithms are known
EULER - A New Approach to Fragment Assembly • Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly • EULER uses the Eulerian Path approach borrowed from “sequencing by hybridization” (SBH) • Fragment assembly without repeat masking can be done in linear time with greater accuracy
Sequencing by Hybridization (SBH) • Build a microarray with all 4l DNA sequences of length l (l ~ 20) • For DNA sequence s, measure l-mer composition
l-mer composition Def: Given string s, the Spectrum ( s, l ) is unorderedmultiset of all possible (n – l + 1) l-mers in a string s of length n • The order of individual elements in Spectrum ( s, l ) does not matter • For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}
The SBH Problem • Goal: Reconstruct a string from its l-mer composition • Input: A multiset S, representing all l-mers from an (unknown) string s • Output: String s such that Spectrum ( s,l ) = S
CG GT TG CA AT GC Path visited every EDGE once GG SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S de Bruijn graph of S
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions: SBH: Eulerian Path Approach CG CG GT GT TG AT TG GC AT GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA
Euler Theorem • A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) • Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.
Euler Theorem: Proof • Eulerian → balanced for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) • Balanced → Eulerian ???
Algorithm for Constructing an Eulerian Cycle a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is balanced this dead end is necessarily the starting point, i.e., vertex v.
Algorithm for Constructing an Eulerian Cycle (cont’d) b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.
Algorithm for Constructing an Eulerian Cycle (cont’d) c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).
Repeat Repeat Repeat Overlap Graph: Hamiltonian Approach Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others. Find a path visiting every VERTEX exactly once: Hamiltonian path problem
Repeat Repeat Repeat Overlap Graph: Eulerian Approach Placing each repeat edge together gives a clear progression of the path through the entire sequence. Find a path visiting every EDGE exactly once: Eulerian path problem
Repeat2 Repeat2 Repeat1 Repeat1 Multiple Repeats Can be easily constructed with any number of repeats
Repeat Graph (a) DNA sequence with a triple repeat R; (b) the layout graph; (c) construction of the de Bruijn graph by gluing repeats; (d) de Bruijn graph. Pevzner P. A. et.al. PNAS 2001;98:9748-9753
? Building Repeat Graph • Problem: Construct the repeat graph from a collection of reads. • Solution: Break the reads into smaller pieces.
Building Repeat Graph • Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty. • They are then broken again into k-mers
EULER Fragment AssemblyApproach • Input: Reads s1, …, sN • Further subdivide reads into k-mers (k = 20) • Build repeat graph on resulting k-mers • Each read is path in resulting graph. • Solve Eulerian Superpath Problem. Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.
CG GT Two Eulerian paths: (visit every EDGE once) TG CA AT GC GG Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read Edges correspond to k – mers in each read Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } ATGCGTGGCA ATGGCGTGCA
CG GT TG CA AT GC GG Reads in Repeat Graph Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths. ATGCGTGGCA ATGGCGTGCA
Additional challenges in EULER Approach • Errors in reads • Reverse-complement of DNA string • Using mate-pair information to simplify the repeat graph. • Multiplicities of edges generally unknown (Copy number problem).
Sequencing Errors • If an error exists in one of the 20-mer reads, the error will be perpetuated among all of the smaller pieces broken from that read.
Sequencing Errors • However, an error will not be present in the other instances of the 20-mer read. “Consensus first” approach • Let T = {all l-tuples appearing in > M reads} • A string s is called a T-string if all its l-tuples belong to T. • Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string.
Sequencing Errors • Solving Spectral Alignment Problem attempts to eliminate most point mutation errors before reconstructing the original sequence. • Not perfect!
Forward and Reverse Complements 5’ 3’ 3’ 5’ We obtain reads from both strands of DNA. Do not know strand of origin. s = CAGT s’ = ACTG (reverse complement)
Forward and Reverse Complements 5’ 3’ In Euler assembler, include reverse complement of each read. “assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement)” 3’ 5’ Alternative approaches using bidirected graphs.
Using Mate-Pair Information Repeats and other ambiguities lead to tangles in repeat graph 1 3 1 3 and 2 4 OR 1 4 and 2 3 ? 2 4 A repeat v1 … vn and a system of paths overlapping with this repeat
Using Mate-Pair Information l(r1, r2) Mate-pair (r1, r2) gives pair of positions in G. Find path P in G from r1 to r2. r1 r2 r1 r2 d(r1, r2) 1 3 If unique path P with d(r1, r2) ≈ l(r1, r2) length of mate pair, then use P as “long read” in superpath algorithm 2 4
Using Mate-Pair Information Scaffolding
Copy number problem Let d(v) = in degree – outdegree Balanced graph: d(v) = 0 for all v. Goal: Introduce multiplicities on edges so that graph is balanced.
Copy number problem Goal: Introduce multiplicities on edges so that graph is balanced. Use as few extra edges as possible. Balance each vertex by adding edge multiplicities Assign flow f(e) to each edge such that d(v) = 0 for all vertices.
Copy number problem Let d(v) = indegree – outdegree Balanced graph: d(v) = 0 for all v. Graph G = (V, e, w). Weights w(e) = 1 for all e. Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e.
Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-flow Max-cut Theorem: For a directed acyclic graph G = (V, e, w) with lower capacity bounds: min flow from v to w = capacity of the maximum cut separating v from w
Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-cost circulation (See Myers 2005): Assign cost c(e) = 1 to each edge. min Σc(e) f(e) such that f(e) ≥ w(e) for all e. d(v) = 0 for all vertices.
Next-generation sequence platforms • 454 • http://www.454.com/enabling-technology/index.asp • Illumina • http://www.illumina.com/pages.ilmn?ID=203 • ABI Solid • solid.appliedbiosystems.com
Polony sequencing—Assembly ? • Resulting reads are likely to look different than Sanger reads: • Short (currently 100 to 200 bp) • Low error rates, except in homopolymeric runs (AAA…, CCC…, etc) • Currently, not known how to do paired reads on a chip. Maybe very soon!
Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm
Nanopore Sequencing—Assembly • Resulting reads are likely to look different than Sanger reads: • Long (perhaps 10,000bp-1,000,000bp) • High error rate (perhaps 10% – 30%) • Two colors? • A/ CTG • AT/ CG • AG/ CT • How can we assemble under such conditions?
Some future directions for sequencing • Personalized genome sequencing • Find your ~1,000,000 single nucleotide polymorphisms (SNPs) • Find your rearrangements • Goals: • Link genome with phenotype • Provide personalized diet and medicine • (???) designer babies, big-brother insurance companies • Timeline: • Inexpensive sequencing: 2010-2015 • Genotype–phenotype association: 2010-??? • Personalized drugs: 2015-???
Some future directions for sequencing 2. Environmental sequencing • Find your flora: organisms living in your body • External organs: skin, mucous membranes • Gut, mouth, etc. • Normal flora: >200 species, >trillions of individuals • Flora–disease, flora–non-optimal health associations • Timeline: • Inexpensive research sequencing: today • Research & associations within next 10 years • Personalized sequencing 2015+ • Find diversity of organisms living in different environments • Hard to isolate • Assembly of all organisms at once