1 / 33

Fragment assembly of DNA

Fragment assembly of DNA. A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them. Fragment assembly of DNA. Biological background Models Algorithms Heuristics. Biological background. Problem as puzzle

vaughan
Télécharger la présentation

Fragment assembly of DNA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

  2. Fragment assembly of DNA • Biological background • Models • Algorithms • Heuristics ® Pei-Jie Wu

  3. Biological background • Problem as puzzle • We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair. • Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows. ® Pei-Jie Wu

  4. Biological background • Target: The long sequence to reconstruct. • Fragment vs. Subsequence • Shotgun method:Based on fragment overlap • Fragment assembly: A collection of fragments to put together ® Pei-Jie Wu

  5. Biological background--The ideal case • Case: p.106 • Aligned the input set, ignoring spaces at the extremities • Overlaps: the end part of a fragment is similar to the beginning of another • Consensus sequence base on majority vote ® Pei-Jie Wu

  6. Biological background--Complications • The main factors that add to the complexity of the problem are: • Error • Unknown orientation • Repeated regions • Lack of coverage. ® Pei-Jie Wu

  7. Biological background--Complications Errors • It usually means algorithms that require more time and space when computer program deal with error. • The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments. • Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters. • Figures 4.2, 4.3, 4.4 ® Pei-Jie Wu

  8. Biological background--Complications Errors • Two other types of errors: chimera and Contamination • Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target • Figure 4.5 • Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage. • Contamination is from host or vector DNA • Solution: Most vectors are well know, so we can screen the data before starting assembly. ® Pei-Jie Wu

  9. Biological background--Complications Unknown orientation • We generally do not know to which strand a particular fragment belongs to. • The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement. • Figure 4.6 • Complexity: 2n ® Pei-Jie Wu

  10. Biological background--Complications Repeated regions • Repeats are sequences that appear two or more times in the targrt molecule. • Short repeats • Longer repeats • If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors • Figure 4.7 ® Pei-Jie Wu

  11. Biological background--Complications Repeated regions • Problems: • If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy. • Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9) • Direct repeats: repeated copies in the same strand. • Inverted repeats: repeated regions in opposite strands (Figure 4.10) ® Pei-Jie Wu

  12. Biological background--Complications Lack of coverage • Coverage: position i of the target as the number of fragments that cover this position. • Contigs: The contiguously covered regions • Figure 4.11 • Solutions: • Sampling more fragments • Directed sequencing or walking ® Pei-Jie Wu

  13. Biological background--Alternative methods for DNA sequencing • Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project. • Problem: • It is expensive to build special primers • Sequential rather than parallel • Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes. ® Pei-Jie Wu

  14. Models • Shortest common superstring (SCS) • RECONSTRUCTION • MULTICONTIG • All three assume that the fragment collection is free of contamination and chimeras. ® Pei-Jie Wu

  15. Models--Shortest common superstring • Seeking the shortest superstring of a collection of given strings • PROBLEM: Shortest common superstring (SCS) • INPUT: a collectionF of strings. • OUTPUT: a shortest possible string S such that for every fF , S is a superstring of f. ® Pei-Jie Wu

  16. Models--Shortest common superstring • Example 4.1 • Example 4.2 • Figure 4.12 • Figure 4.13 • A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies ® Pei-Jie Wu

  17. Models--Reconstruction • Takes into account both errors and unknown orientation • Dynamic programming sequence comparison algorithm • Use distance rather than similarity • Expression: p.116 ® Pei-Jie Wu

  18. Models--Reconstruction • PROBLEM: RECONSTRUCTION • INPUT: a collectionF of strings and an error tolerance  between 1 and 0. • OUTPUT: (p.117) • Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level  • Does not model repeats, lack of coverage, and size of target ® Pei-Jie Wu

  19. Models--Multicontig • Involve internal linkage of the fragments in the layout • Nonlink: there is a fragment that properly contains the overlap on both sides • Weakest link: the smallest size of any link • t-contig: the weakest link of a layout is at least as large as t • Example 4.4 • Definition: p.119 ® Pei-Jie Wu

  20. Algorithms • Greedy algorithm • Acyclic subgraphs (no errors and know orientation) ® Pei-Jie Wu

  21. Algorithms--Representing overlaps • Over multigraph OM(F) of a collection F is the directed, weighted multigraph • Set V of nodes of this structure is just F itself. • A directed edge from a to a different fragment b with weight t  0 exists if the suffix of a with t characters is a prefix of b • May be many edges from a to b • No self-loops ® Pei-Jie Wu

  22. Algorithms--Paths originating superstrings • Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e • Figure 4.15 • Example in p.121 • Equation 4.3 • Hamiltonian paths: A path that goes through every vertex • Equation 4.4 • Minimizing |S(P)|  maximizing w(P) ® Pei-Jie Wu

  23. Algorithms--Shortest superstrings as paths • A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b. • THEOREM 4.1 • COROLLARY 4.1 • LEMMA 4.1 • THEOREM 4.2 ® Pei-Jie Wu

  24. Algorithms--The greedy algorithm • Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. • OM(F)  OG(F) • “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge ® Pei-Jie Wu

  25. Algorithms--The greedy algorithm • Three conditions we have to test before accepting an edge in our Hamiltonian path: • Edges are processed in nonincreasing order by weight • The procedure ends when we have exactly n-1 edges, or • when the accepted edges induce a connected subgraph. • Figure 4.16 • Example 4.5 • Figure 4.17 ® Pei-Jie Wu

  26. Algorithms--Acyclic subgraphs • Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA. • “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly. • Figure 4.18 ® Pei-Jie Wu

  27. Algorithms--Acyclic subgraphs • The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph. • Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph. • THEOREM 4.5 • Algorithm: Topological sorting • Example 4.6 • Figure 4.19, 4.20 and 4.21 ® Pei-Jie Wu

  28. Heuristics • None of the formalisms proposed for fragment assembly are entirely adequate • Fragment assembly can be viewed as a multiple alignment problem with some additional feature: • Each fragment can participate with either the direct or the reverse-complemented sequence. • The sequences themselves are usually much shorter than the alignment itself. ® Pei-Jie Wu

  29. Heuristics • Three criteria according to the second feature: • Scoring • Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal • Lower the entropy, the better • Coverage: • A fragment covers a column i if it participates in this column either with a character or with an internal space. • Linkage • The way individual fragment are linked in the layout is another determinant of layout quality. • Figure 4.22 ® Pei-Jie Wu

  30. Heuristics--Assembly in practice • Practical implementations often divide the whole problem in three phase: • Finding overlaps • Building a layout • Computing the consensus ® Pei-Jie Wu

  31. Heuristics--Assembly in practice Finding overlaps • The first step in any assembly problem is fragment overlap delection. • Determine reverse complement • Consider fragments entirely contained in other fragment • Recall Section 3.2.3 • Figure 4.23 ® Pei-Jie Wu

  32. Heuristics--Assembly in practice Ordering fragments • Finding a good ordering of fragments in a contig • No algorithm that is simple and general enough • There are four issues to keep in mind when building paths: • Every path has a corresponding complement path • It is not necessary to include contain fragments • Cycles usually indicate the presence of repeats • Unbalanced coverage may be related to repeats as well (see Figure 4.13) ® Pei-Jie Wu

  33. Heuristics--Assembly in practice Alignment and consensus • Building a layout from a path in an overlap graph • Two techniques related to alignment construction: • The first one helps in building a good layout from a path in the presence of errors. • Example 4.7 • Implement: Figure 4.24 • The second one focuses on locally improving an already constructed layout • Example 4.8 in Figure 4.25 • Implement: sum-of-pairs scoring scheme ® Pei-Jie Wu

More Related