1 / 8

CAP5510 – Bioinformatics Sequence Assembly

CAP5510 – Bioinformatics Sequence Assembly. Tamer Kahveci CISE Department University of Florida. What is Sequence Assembly?. We can only sequence short fragments (100 – 500 bases). How can we sequence long sequences (e.g., single chromosome can have hundreds of millions of bases) ?

tanaya
Télécharger la présentation

CAP5510 – Bioinformatics Sequence Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CAP5510 – BioinformaticsSequence Assembly Tamer Kahveci CISE Department University of Florida

  2. What is Sequence Assembly? • We can only sequence short fragments (100 – 500 bases). • How can we sequence long sequences (e.g., single chromosome can have hundreds of millions of bases) ? • Chop long sequence to many small fragments • Sequence all fragments • Put them together to construct the long sequence • Problem: Consider a long sequence S. Given a collection of subsequences (aka fragments or reads) of S, denoted with R = {r1, r2, …, rn}. Construct S from R

  3. Sequence Assembly Coverage: average number of reads in R containing a base in S. • Issues: • Errors in R • Repeats in S Repeat

  4. Assemblers • De novo: No knowledge known about S. • Slow • Phusion (Mullikin & Ning 2003) • Arachne (Batzoglou et al. 2002) • CAP (Huang & Madan, 1992) • Mapping: A similar sequence to S is known. • Needs prior knowledge on S. • Shrimp (Rumble et al. 2009)

  5. Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Create a histogram of k-mers (k = 17) • Remove repetitive ones (13 or more occurrences)

  6. Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Create a histogram of k-mers (k = 17) • Remove repetitive ones (13 or more occurrences) • Keep a list for each k-mer showing the reads that contain it. • Find all pairs of reads sharing at least one k-mer • Keep the number of common k-mers for each such pair

  7. Phusion (Mullikin & Ning 2003) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Assemble each cluster into a contig • Given a pair of reads, extend their matching k-mers • Join overlapping contigs • If two contigs share a read, try to put them together into a longer contig by splicing them first.

  8. Euler (Pevzner et al. 2001) • Clipping: Remove low quality reads, clip ends. • Clustering: Group similar reads together. • Assemble each cluster into a contig • Create de Brujin graph • Each node is a k-mer • A directed edge indicates a dove tail overlap of k-1 positions • Find the Eulerian path on this graph (visit each edge once) – polynomial • Not the Hamiltonian path (visit each vertex once) – NP complete

More Related