1 / 62

Genomic Sequencing

Genomic Sequencing. DNA Sequencing. Fred Sanger, Cambridge, England Copy DNA with one of four bases starved ACGTAAGCTA with T starved produces ACG and ACGTAAGC Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base)

navid
Télécharger la présentation

Genomic Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomic Sequencing

  2. DNA Sequencing • Fred Sanger, Cambridge, England • Copy DNA with one of four bases starved • ACGTAAGCTA with T starved produces ACG and ACGTAAGC • Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) • Separate resulting fragments by length • Animations • http://www.youtube.com/watch?v=UT9wqaVCH5s • http://www.mun.ca/biology/scarr/4241_RMC_Sequencing.html • http://dnalc.org/view/15479-Sanger-method-of-DNA-sequencing-3D-animation-with-narration.html

  3. DNA Sequencing • Later, sequencing machines sequence 500-700 nt fragments, called reads • Reads are assembled into a continuous genome (difficult) • Shotgun sequencing • Current • Next Generation Sequencing (NGS) • www.cs.uml.edu/~kim/580/10_ngs.pdf

  4. Shot-gun Method • Shotgun method • Break up DNA into small fragments, each of which is sequenced • Use computer to search for overlap • Build a master sequence • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Repeats in sequences are problems

  5. Shot-gun Sequencing

  6. Shot-gun Sequencing

  7. Genetic Maps • For long genomes, use genetic markers • Use shot gun method and locate known markers in the master sequence • Known genes can be markers

  8. Restriction Map • Restriction endonuclease • An enzyme binding to specific DNA sequences, and making double-stranded cut at or near the sequences • Type II always cut at the same place (over 2,500 type II) • e.g., HindII cuts at GTGCAC or GTTAAC

  9. Complete and Partial Digest • Probability of restriction site being cut • = 1: complete digest • Distance between successive cuts is known and accurate • <1 : partial digest • Distances across more than one restriction site are generated

  10. Partial Digest Problem (PDP) • X = {x1=0, x2, . . ., xn}: an ordered set of n points on a line • ΔX = {xi- xj| 1 ≤i<j ≤ n}: a multiset of pairwise distances with ( n2) elements • Partial Digest Problem (PDP) • Given a multiset L containing ( n2) integers of pairwise distances • Find a set X of n integers such than ΔX = L • Also, called Turnpike problem, reconstructing highway from pairs of exits • Unique set X is not always possible • e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a ЄA} (one set is a shift of another set) • A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} • e.g., if ΔA = Δ(-A) • A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} • In general, U + V and U – V are homometric

  11. PDP(1) • Brute force approach • Given L, • Compute ΔX for every possible combination of X • Until X is found such that ΔX = L • Need to examine (M-1n-2) different set of positions • => O(Mn-2) BruteForcePDP(L, n) M← max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M X ← {0 < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”

  12. PDP(2) • Brute force approach • Given L, • Identical to BruteForcePDP() except that xiЄ L • Need to examine (|L|n-2) different set of positions • => O(M2n-4) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M from L X ← {0, < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”

  13. PDP(3) • Steven Skiena, 1990 • Largest in L determines the two outermost points in X • e.g. L = {2,2,3,3,4,5,6,7,8,10} • Pick 10: • X={0,10} • L = {2,2,3,3,4,5,6,7,8) • Pick 8: X={0,2,10} or X={0,8,10} • L = {2,3,3,4,5,6,7} • Pick 7: x3=3 should include x3-x2=1 • X={0,2,7,10} • L = {2,3,3,4,5,6} • . . .

  14. PLACE(L, X) if L is empty output X return y← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and removeΔ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return PartialDigest(L) width ← max(L) DELETE(width, L) X← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X]

  15. Shortest Superstring Problem • Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s1, s2, …., sn Output: A shortest string s that contains all strings s1, s2, …., sn { 000 001 010 011 100 101 110 111 } 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0

  16. Shortest Superstring Problem -2 • Define overlap(si, sj) • The length of the longest prefex of sj that matches a suffix of si • Shortest Superstring problem becomes • Traveling salesman problem with vertices for strings and edges of overlaps

  17. DNA Arrays • Sequencing by Hybridization (SBH) • millions of short DNA fragments called probes in a chip • Input DNA sequence reacts to fragments in an array (chip) via base complementary property

  18. Base coverage • A sample (genome) is amplified • A base is the sample is copied into many reads • But, reads are randomly generated • Poisson distribution • Similarly, k-mers • Still, Poisson distribution, but different

  19. Coverage Depth and Extent • Coverage Depth • The avg number of times each base or k-mer is sequenced • Coverage Extent • The ratio of genome covered by at least one base or k-mer • Given a genome of size G, read length L, read number N • Total number of bases (nb) and k-mers (nk) • nb = N*L; nk = N*(L-k+1) • nb/nb= L/(L-k+1)

  20. Coverage Depth and Extent • Coverage Depth of bases (db) and k-mers (db) • db = nb/G; dk = nk/G • db / dk= L/(L-k+1) • For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (db) from read data before assembly from • G = nk /dk and db = dk* L/(L-k+1)

  21. Coverage Depth or Sequencing Depth • Coverage Depth (db) is called sequencing depth (c) • From Poisson, prob. of non-coverage is • P(X=0) = exp(-c) • Coverage extent is P(X>0) = 1- exp(-c) • To cover >99% of a genome, c>4.6 • To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 • Human genome (3 Gb): c>22

  22. SBH • Given an unknown DNA sequence, DNA array provides • All strings of length l that the sequence contains • No information about their positions • Spectrum (s, l) • For string s of length n, the l-mer composition with multiset of n-l+1 l-mers in s • l=3, s=TATGGTGC • Spectrum(s.l) = {TAT, ATG, TGG, GGT, GTG, TGC}

  23. SBH as a Hamiltonian Path Problem • Two l-mers overlap if overlap(p,q) = l-1 • Hamiltonian Path Problem • Given Spectrum (s, l), and a vertex for every l-mer in Spectrum (s, l) • Connect every two vertices if two vertices overlap, • So that visit every vertex • Overlap-Layout-Consensus (OLC) • NP-complete

  24. OLC • Conventional shotgun sequencing • Overlap-layout-consensus • Use computer to search for overlap: trying for all possible pairs of fragments • Layout: putting fragments together • Consensus: error correction • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Difficult • No solution for “repeat problem” to find correct path in the layout step • Produce sequencing errors • Programs • PHRAP, CAP, TIGR, CELERA

  25. SBH as an Eulerian Path Problem • A graph with all (l-1)-mers (later) • edges corresponding to l-mers from Spectrum (s, l) • Find a path visiting every edge exactly once

  26. Eulerian Path Problem • Repeatedly find Eulerian cycles in the graph • Linear time

  27. De Bruijn Graph • Partition read fragments into fixed-size k-mers • k = 27, for example • Each (k-1)-mer becomes a graph node

  28. OLC vs. De Bruijn Graph

  29. De Bruijn Graph • Eulerian Graph • De Bruijn Graph • Glue parallel links with multiplicity (e.g., multiplicity of 3) • Tangle: # of input edges is not equal to # of output edges

  30. De Bruijn Graph • How to construct de Bruijn graph from collections of sequencing reads ? • Gluing requires knowledge of finished sequence • Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed • Let s be a sequencing read with error • If genome sequence G is known, errors in s can be done by aligning s against G • But G is not known until the last “consensus” step • EULER uses SA to minimize errors in the first step

  31. Programs • ABySS (Assembly By Short Sequencing) • Simpson, 2009 • www.cs.uml.edu/~kim/580/08_abyss.pdf • Velvet • Zerbino and Birny, 2008 • www.cs.uml.edu/~kim/580/08_velvet.pdf • Euler • Pevzner, 2001 • www.cs.uml.edu/~kim/580/01_pevzner.pdf • www.cs.uml.edu/~kim/580/09_chaisson.pdf • SOAPdenovo (Short Oligonucleotide Alignment Program) • Beijing Genomics Institute • www.cs.uml.edu/~kim/580/09_soap.pdf

  32. ABySS • Proceeds in two stages • Stage 1 • All possible k-mers are generated from reads • Remove read errors and construct initial contigs • Stage 2 • Use mate-pairs to extend contigs • Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers

  33. ABySS – Stage 1 • Three steps • Load read data into distributed de Bruijn graph • Resolve read errors • Merge graph nodes • Load read data into distributed de Bruijn graph • Reads with unknown bases are discarded • Each read is broken into (read_length-k+1) overlapping k-mers • A k-mer is assigned to one cluster node • Compute adjacency of k-mers • For each k-mer, a message is sent to its eight possible neighbors • If a neighbor exists, there must be a k-1 bp overlap

  34. ABySS – Stage 1 (cont’d) • Resolve read errors • Remove dead-ends • When correct k-mers of a read connect to incorrect k-mers, • They are likely to be unique and most will not have an extension • One end of the branch will terminate with no extension • Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold • Remove bubbles • A branch diverges and rejoins later • Caused by single base differences

  35. ABySS – Stages 1 and 2 • Vertex merging • Merge vertices linked by unambiguous edges • Contig merging • Use paired-end info

  36. ABySS Results • Genome of African male from NCBI Short Read Archive: Accession # SRA000271 • 3.5 Billion mate-paired reads, x42 • Read length: 36-42 bp, median fragment 210 bp • At k=27, 15h run time without paired-end info

  37. ABySS Comparisons

  38. Velvet • Construct a graph • Transform reads into roadmaps • From a read, generate k-mers with read ID and position in the read (called roadmaps) • Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers • 2nd database • For each read, which k-mers are overlapped by subsequent reads • Trace reads through the graph using roadmaps

  39. Velvet • Graph simplification • A node with one outlink can be combined with a next node with one input link • Error removal • Focus on topological features • Tips (dead-ends) shorter than 2k • bubbles due to internal read errors (Tour Bus algorithm) • Erroneous connections due to distant merging tips • Breadcrumb – use read pairs to extend contigs

  40. EULER, 2001 • EULER • Implement Eulerian Path problem • Issues with real data • Reads may have errors • Error correction is typically done in ‘consensus’ stage • EULER corrects errors in the first step • SA (Spectral Alignment) • Repeat problem • De Bruijn graph

  41. Spectral Alignment (SA) • Genome sequence G is not known, set Gl of all l-mers present in G can be accurately predicted • An l-mer is called solid if it belongs to more than M reads • EULER Approach – approximate Gl as a set of all solid l-mers • SBH problem without read errors • Construct a graph with edges corresponding to l-mers from Spectrum(s, l) • Find a path visiting every edge exactly once • SA • Given a string s and Gl, find the minimum number of mutations in s that transform s such that Spectrum(s, l) = Gl • Can be efficiently programmed by dynamic programming

  42. Spectral Alignment (SA) • Formulation • Given a set of reads R = {r1, .., rn}, integer l, and upper bound Δ on the number of errors in each read • Spectrum Sl is a set of all l-mers from reads r1, .., rn and reverse complements r1’, .., rn’ • Introduce up to Δ corrections in each read in R such that | Sl | is minimized • Result • One correction in a read can correct l from R andl from R’ • Reduces 86.5% of read errors • But, can create errors • One change in a read may change all reads in the region • Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome • Correct 234,410 errors, introduce 1,452 errors in NM

  43. EULER - Results • No incorrect contigs

  44. Summary of EULER • Eulerian Path approach – de Bruijn graph • Do error correction early – SA • Fill gaps ASAP

  45. EULER +, 2004 • EULER+, 2004 • A-Bruijn graph • To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l-mer in de Bruijn assembly • Graph simplication algorithms to remove errors in edges • De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing

More Related