1 / 39

DNA Fragment Assembly

DNA Fragment Assembly. CIS 667 Spring 2004 February 18. Objectives. The problem: DNA Fragment Assembly The ideal case The complications Models The Shortest Common Superstring Reconstruction Multicontig A greedy algorithm Heuristics. The Problem.

jewell
Télécharger la présentation

DNA Fragment Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Fragment Assembly CIS 667 Spring 2004 February 18

  2. Objectives • The problem: DNA Fragment Assembly • The ideal case • The complications • Models • The Shortest Common Superstring • Reconstruction • Multicontig • A greedy algorithm • Heuristics

  3. The Problem • Assumption: We know the length of the target sequence approximately • The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. • We determine only one of the strands of the original molecule

  4. The ideal case Input: • The set of fragments: ACCGT CGTGC TTAC TACCGT • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus by majority of votes)

  5. Complications • real problem instance is very large • errors • substitutions • insertions • deletions • chimeras • unknown orientation of the fragments • repeated regions • causes ambiguity in sequencing • lack of coverage • causes gaps

  6. Errors: Substitution Input: • The set of fragments: ACCGT CGTGC TTAC TGCCGT  substitution • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T G C C G T _ _ T T A C C G T G C (consensus by majority of votes)

  7. Errors: Insertion Input: • The set of fragments: ACCGT CAGTGC  insertion TTAC TACCGT • Total length 10bp Output: _ _ A C C * G T _ _ _ _ _ _ C A G T G C T T A C _ * _ _ _ _ _ T A C C * G T _ _ T T A C C * G T G C (consensus by majority of votes)

  8. Errors: Chimeras Input: • The set of fragments: ACCGT, CGTGC, TTAC, TACCGT, TTATGC • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus) T T A _ _ _ T G C A chimera arises when two regular fragments from distinct parts of the target molecule join end-to end Remedy: recognize them before use!

  9. Repeated Regions • Unknown orientation with no errors • Unknown orientation with errors • Repeated regions causes ambiguity P X Q X R X S P X R X Q X S

  10. Direct repeat • Direct repeat • More complex are inverted repeat • repeated regions in opposite strands P X Q Y R X S Y P X S Y R X Q Y

  11. Lack of coverage • causes formation of gaps • compute the mean coverage • add up all the fragments and divide by the target length • insufficient coverage is covered by sampling more fragments • How many fragments do I need? • Assume • all fragments have the same length • let t be the safe overlap of at least t bases • n is the number of fragments • T is the target length Apparent contigs: p = n e –n(l-t)/T

  12. Shortest Common Superstring Input: A collection F of strings Output: A shortest possible string S |  fF, S is a superstring of f. Example: F={ATG, TGC, GCC} S= ATGCC Question: Is it the shortest? Observe: u=ATG and v=GCC overlap in G and TGC is a substring

  13. Shortest Common Superstring Is it a good problem? Advantages: • The problem finds the PERFECT superstring • Good for most ideal cases Disadvantages: • the problem does not deal with errors • good only in some ideal cases • in presence of no errors and known orientation, it fails in presence of repeat • repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage • It does not consider lack of coverage and size of the target • NP-hard

  14. Reconstruction Objective:We want to consider errors and unknown orientation Substring Edit Distance ds(a,b) = min dss(b)(a,s) • one unit is charged for insertion, deletion, substitution • no charges for deletion in the extremity of 2nd sequence Example u=CGATGT v=AACTAATGTGC _ _ C G A * T G T _ _ A A C T A A T G T G C ds(u,v) = 2 • A string f is an approximate substring of S at error level  (between 0, 1) whends(f,S)  |f|

  15. Reconstruction Input: A collection F of strings, an error tolerance  with 01 Output: A shortest possible string S |  fF, we have min(ds(f, S), ds(f ,S))  |f| where f is the reverse complement Advantage: • takes into account errors and unknown orientation Disadvantages: • Is an NP-hard problem • It does not model repeats • It does not consider lack of coverage and size of the target

  16. Multicontig Objective: We want to consider internal linkage No special assumptions except: • for known orientation, fragment and reverse complement are not both present in the collection. • We want to have good linkage (overlap between fragments) • An overlap is a link if it is not (properly) contained in a bigger fragment • The smallest size of a link in a layout is called a weakest link • A layout is a t-contig if its weakest link is at least size t • We partition F into the minimum number of collections which admit a t-contig

  17. Multicontig Idea: Let's partition F in the minimum number of t-contigs! Example: F={GTAG, TAATG, TGTAA} for t=3 F1={TAATG, TGTAA} and F2={GTAG} for t=2 we have two solutions • F1={TAATG, TGTAA} and F2={GTAG} • F1={TAATG,TGTAA} and F2={GTAG} for t=1 we have the desired solution (the minimum) F1={TAATG, TGTAA, GTAG} For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

  18. Multicontig Input: A collection F of strings, and an integer t0 and an error tolerance  with 01 Output: A partition of F in the minimum number of subcollections Ci. 1i  k | every Ci admits a t-contig with an -consensus Advantage: • takes into account errors and unknown orientation • take into account internal linkage of the fragments • the answer is formed by several contigs Disadvantages: • Is an NP-hard problem even in the simplest case of no errors and known orientation • It contains as a special case finding a Hamiltonian path in a restricted class of graphs • It has no provision to use information on the approximate size of the target

  19. TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t

  20. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  21. TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACAG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t

  22. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  23. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  24. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  25. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  26. Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

  27. Theoretical results Theorem: the total length of A (set of fragments) is ||A|| = w(P) + |S(P)| where • ||A||=aA |a| • w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P. • to convince yourself, read the proof from the book Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

  28. The Greedy Methodology NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time To apply a greedy methodology: • the problem must show optimal substructure • A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems • the optimal solution is reached by taking the best "local" choice

  29. TACGA 1 2 ACCC CTAAAG 1 1 1 GACA Overlap Graph • The Greedy Algorithm • input: weighted di-graph OG(F) with n vertices • output: Hamiltonian path in OG(F) • //Initialize • for i 1 to n do • in[i]=0 //how many selected edges enter i • out[i]=0 //how many selected edges exit i • MakeSet(i) • //Process • Sort the edges by weight, heaviest first • for each edge (f,g) in this order do • //test for acceptance • ifin[g] = 0 andout[f] = 0 and FindSet(f) ≠FindSet(g) • select (f,g) • in[g]1 • out[f] 1 • Union(FindSet(f), Findset(g)) • if there is only one component break • return selected edges An overlap graph has only edges with maximum weight

  30. A graph where "greedy" fails • F={GCAAAG, AGTA,TACGA} We order the edges by weight (AGAT, GCAAAG) = 3 (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 The algorithm will choose first (AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path. Instead the solution should be (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 TACGA 2 GCAAAG 3 2 AGTA

  31. Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Issues: Scoring Coverage Linkage

  32. Heuristics • Scoring • Uniformity is good, variability is bad. • Compute the entropy of a column • the entropy is the measure of the chaos in a column. There are 5 possible characters, A, T, C, G, space E=-cpclog pc E=0 if pc=1 for a character; E=log5 if each pc=1/5 • To measure the uniformity we want a low entropy per column • Coverage • minimun, maximum or medium coverage • if the coverage reaches 0 for a column I, we do not have a connected layout • if we have more columns with zero coverage, any permutation of the intervening regions (the contig) is acceptable • Coverage gives confidence to the consensus • Linkage • High coverage with no links is not good. Overlap is required.

  33. More Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. • Assembly in practice consists of: • Finding overlaps • Building Layout • Computing the consensus Advantages:  • We treat each problem separately. Disadvantages:  • It becomes difficult to understand the relationship between the input and the final output

  34. Heuristics • Finding overlaps • use a dynamic programming approach with a score system such as • 1 for matches • -1 for mismatches • -2 for spaces • Do not charge for space after the first sequence and before the second one.

  35. Heuristics Ordering Fragments • there is no algorithm simple and general enough Considerations: • Use the set DF=F F • If f=uv g=wx then g =wx f = vu • if f is approximately the same as the beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

  36. Finding overlaps • Finding a good ordering of overlapping means finding a direct path in the overlap graph • Both strands are constructed simultaneously • Contained fragment are not essential in the path • A disconnected graph indicates lack of coverage • The presence of cycles indicates repeats • Unusual high coverage indicate possible repeats • The presence of reverse complement cycles indicates inverted repeats

  37. Alignment and Consensus • Use the minimal sum of the distances Suppose we have f  g  h CATAGTC TAACTAT AGACTATCC Two semiglobal aligments for f and g are: C A TAG T C_ _ _C ATA GT C_ _ _ _ _ TAA _ C TA T _ _TA_ A CT A T C ATA GT C_ _ _ _ _TA _ A CT A T _ _ _A G A CT A T C C CATA GA C T A T C C • ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2 • ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3

  38. A Linked List of Bases • Sometimes we know what is best only later. Is there a structure that helps us? • Use a Linked List of Bases • matches bases are unified in one node • unmatched bases are left separate • Technique: Traverse this graph in topological order. G  T C  A  T  A C  T  A  T A

  39. Conclusions • The models fail to address all the issues involved in the problem • The effective real problem is NP-hard • Approximation gives us some help but fails in some cases • Heuristics helps and the problem is broken in 3 smaller problems: • finding overlap • building layout and • computing the consensus • Are we sure there is nothing else to do? • We will look next week at the smaller problem of comparing only two sequences instead of many. Will we find something better?

More Related