1 / 19

A new Approach to Fragment Assembly in DNA Sequenceing

A new Approach to Fragment Assembly in DNA Sequenceing. Fei wu April ,24,2006. Preface. Introduce the author The background of the paper The history of DNA Sequencing. Traditional DNA Sequencing. DNA. Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)

lavey
Télécharger la présentation

A new Approach to Fragment Assembly in DNA Sequenceing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April ,24,2006

  2. Preface • Introduce the author • The background of the paper • The history of DNA Sequencing

  3. Traditional DNA Sequencing DNA • Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) • Shear DNA into millions of small fragments Shake

  4. Fragment Assembly • Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“super string”) • Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

  5. Shortest Superstring Problem • Problem: Given a set of strings, find a shortest string that contains all of them • Input: Strings s1, s2,…., sn • Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized • Complexity: NP – complete • Note: this formulation does not take into account sequencing errors

  6. Reducing SSP to eulerian path problem • Define overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa • Construct a graph with n vertices representing the n strings s1, s2,…., sn. • Insert edges of length overlap ( si, sj ) between vertices siand sj. • Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

  7. Bruijun graph • Properties If n = 1 then the condition for any two vertices forming an edge holds vacuously, and hence all the vertices are connected forming a total of m2 edges. Each vertex has exactly m incoming and m outgoing edges

  8. Sequencing by Hybridization

  9. l-mer (tulip) composition • Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length n • The order of individual elements in Spectrum ( s, l ) does not matter • For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

  10. CG GT TG CA AT GC Path visited every EDGE once GG SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S

  11. S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: CG CG GT GT TG TG GC AT GC CA GG GG ATGGCGTGCA ATGCGTGGCA

  12. Error Correction Or Data Corruption • Euler algorithm sometimes introduces errors. • Introduces errors for reducing the complexity of the Bruijn graph. • Reeducation of Bruijn graph eliminate false edge. • For example: N.meningitieds sequencing project,orphan elimination corrects 234410 errors, and introces 1452 errors.

  13. Observations of the EULER

  14. Conclusions • Finishing is a bottleneck in large-scale DNA • EULER has excellent scaling potential . • The complexity of EULER is mainly defined by the number of tangles rather than the number of repeats/length of the gonomes.

  15. RESULTS AND DISCUSSION • The general performance of SEA on the benchmark • Prediction ambiguity improves alignment quality • Alignment quality versus local structure prediction ambiguity

  16. CONCLUSION

  17. Any Questions?

More Related