1 / 20

De Novo Repeat Classification and Fragment Assembly

De Novo Repeat Classification and Fragment Assembly. 석사 1 년 김 우 연. PROGRAMS related Repeat. Repeat Annotation - libraries RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) MaskerAid ( Bedell et al. 2000 ) No de novo compilation Repeat Analysis RepeatMatch ( Delcher et al. 1999 )

ursula
Télécharger la présentation

De Novo Repeat Classification and Fragment Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De Novo Repeat Classification and Fragment Assembly 석사 1년 김 우 연

  2. PROGRAMS related Repeat • Repeat Annotation - libraries • RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) • MaskerAid ( Bedell et al. 2000 ) • No de novo compilation • Repeat Analysis • RepeatMatch ( Delcher et al. 1999 ) • REPuter ( Kurtz et al. 2000, 2001 ) • RECON, RepeatFinder, LTR_STRUC • No compact overview or summary of the repeat family

  3. Genome Research • Received January 27, 2004 • Accepted in revised form June 29, 2004

  4. CONTENTS • Introduction • Concepts • Methods • De Bruijn Graphs & A-Bruijn Graphs • RepeatGluer Algorithm • Constructing A-Bruijn Graphs Without the Similarity Matrix • Fragment Assembly • FragmentGluer Algorithm • Results and Discussion

  5. INTRODUCTION • “The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack” – Bao and Eddy (2002) • One of the difficulties in repeat classification is that many repeats represent mosaics of sub-repeats – Bailey et al. 2002 • Aims • Proposing a new approach to repeat classification • FragmentGluer assembler

  6. CONCEPS

  7. Genomic dot-plot of an imaginary sequence An imaginary evolutionary process Gluing repeated regions leads to the repeat graph the final genome Genomic dot-plot

  8. The idea of our approach By gluing points together, repeats transform into the A-Bruijn graph

  9. Mosaic repeat organization • BAC from human Chromosome Y • Repeat pairs by REPuter & Sub-repeats by our division • Repeat multigraph • Repeat graph • RepeatFinder vs RECON vs REPuter

  10. METHODS

  11. ACTGCTGCC ACTGCTGCC De Bruijn Graphs & A-Bruijn Graphs De Bruijn Graph: ACTGCTGCC TGC GCT GCC ACT CTG

  12. De Bruijn Graphs & A-Bruijn Graphs A-Bruijn Graph: … AT … ACT … ACAT …

  13. Whirls & Bulges Available gaps & mismatch

  14. RepeatGluer Algorithm • Construct the A-Bruijn graph • Eliminate whirls • Remove bulges • Erosion – Remove all leaves • Straighten zigzag paths • Forming the consensus sequence • Output repeat families

  15. Constructing A-Bruijn Graphs Without the Similarity Matrix • Constructing of the A-Bruijn graph assumes S and A • S and { S1, …, St } can construct A-Bruijn graph of S • A set for every pair of consecutive positions in S • Matrix |Si| x |Sj| • A snapshot of a “small” area of matrix A S: A genomic sequence n: the length of S A: matrix n x n { S1, …, St }: A set of substrings |Si|: the length of the string Si

  16. Fragment Assembly • Assemblers • Phrap ( Green 1994 ) • Celera assembler ( Myers et al. 2000 ) • EULER assembler ( Pevzner et al. 2001 ) • http://nbcr.sdsc.edu/euler • ARCHNE, Phusion, CAP, TIGR • Building an accurate assembler • EULER + Phrap EULER+ • EULER’s accuracy in analyzing repeats & Phrap’s ability to handle low-coverage regions, low-quality reads, and read ends • Less memory than the original EULER • FragmentGluer algorithm

  17. FragmentGluer Algorithm • Construct the A-Bruijn graph of S • Eliminate whirls by splitting the composed vertices • Remove bulges • Erosion procedure by removing all leaves • Straighten zigzag paths • Thread each read • Definition consensus sequence • Output repeat families • Transform mate-pairs into mate-paths after step 6 • Assemble the resulting contigs into scaffolds by the EULER Scaffolding algorithm

  18. RESULTS AND DISCUSSION

  19. Benchmarking • EULER produced the least number of misassembled contigs. • EULER also had the least number of missing repeat copies (4), ahead of Phrap (5) and Arachne (9). • Average coverage, over 518 clones, was 99.3% for Phrap, 98.8% for EULER, and 98.6% for ARACHNE • Average number of contigs per clone was the least for EULER (6.2) followed by Phrap (6.8) and ARACHNE (13.8).

  20. More research • The consensus sequence analysis of FragmentGluer • Detecting de novo HERVs as the consensus sequence of FragmentGluer

More Related