Genomic Sequencing

Genomic Sequencing

DNA Sequencing • Fred Sanger, Cambridge, England • Copy DNA with one of four bases starved • ACGTAAGCTA with T starved produces ACG and ACGTAAGC • Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) • Separate resulting fragments by length • Animations • http://www.youtube.com/watch?v=UT9wqaVCH5s • http://www.mun.ca/biology/scarr/4241_RMC_Sequencing.html • http://dnalc.org/view/15479-Sanger-method-of-DNA-sequencing-3D-animation-with-narration.html

DNA Sequencing • Later, sequencing machines sequence 500-700 nt fragments, called reads • Reads are assembled into a continuous genome (difficult) • Shotgun sequencing • Current • Next Generation Sequencing (NGS) • www.cs.uml.edu/~kim/580/10_ngs.pdf

Shot-gun Method • Shotgun method • Break up DNA into small fragments, each of which is sequenced • Use computer to search for overlap • Build a master sequence • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Repeats in sequences are problems

Shot-gun Sequencing

Genetic Maps • For long genomes, use genetic markers • Use shot gun method and locate known markers in the master sequence • Known genes can be markers

Restriction Map • Restriction endonuclease • An enzyme binding to specific DNA sequences, and making double-stranded cut at or near the sequences • Type II always cut at the same place (over 2,500 type II) • e.g., HindII cuts at GTGCAC or GTTAAC

Complete and Partial Digest • Probability of restriction site being cut • = 1: complete digest • Distance between successive cuts is known and accurate • <1 : partial digest • Distances across more than one restriction site are generated

Partial Digest Problem (PDP) • X = {x1=0, x2, . . ., xn}: an ordered set of n points on a line • ΔX = {xi- xj| 1 ≤i<j ≤ n}: a multiset of pairwise distances with ( n2) elements • Partial Digest Problem (PDP) • Given a multiset L containing ( n2) integers of pairwise distances • Find a set X of n integers such than ΔX = L • Also, called Turnpike problem, reconstructing highway from pairs of exits • Unique set X is not always possible • e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a ЄA} (one set is a shift of another set) • A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} • e.g., if ΔA = Δ(-A) • A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} • In general, U + V and U – V are homometric

PDP(1) • Brute force approach • Given L, • Compute ΔX for every possible combination of X • Until X is found such that ΔX = L • Need to examine (M-1n-2) different set of positions • => O(Mn-2) BruteForcePDP(L, n) M← max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M X ← {0 < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”

PDP(2) • Brute force approach • Given L, • Identical to BruteForcePDP() except that xiЄ L • Need to examine (|L|n-2) different set of positions • => O(M2n-4) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M from L X ← {0, < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”

PDP(3) • Steven Skiena, 1990 • Largest in L determines the two outermost points in X • e.g. L = {2,2,3,3,4,5,6,7,8,10} • Pick 10: • X={0,10} • L = {2,2,3,3,4,5,6,7,8) • Pick 8: X={0,2,10} or X={0,8,10} • L = {2,3,3,4,5,6,7} • Pick 7: x3=3 should include x3-x2=1 • X={0,2,7,10} • L = {2,3,3,4,5,6} • . . .

PLACE(L, X) if L is empty output X return y← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and removeΔ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return PartialDigest(L) width ← max(L) DELETE(width, L) X← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X]

Shortest Superstring Problem • Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s1, s2, …., sn Output: A shortest string s that contains all strings s1, s2, …., sn { 000 001 010 011 100 101 110 111 } 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0

Shortest Superstring Problem -2 • Define overlap(si, sj) • The length of the longest prefex of sj that matches a suffix of si • Shortest Superstring problem becomes • Traveling salesman problem with vertices for strings and edges of overlaps

DNA Arrays • Sequencing by Hybridization (SBH) • millions of short DNA fragments called probes in a chip • Input DNA sequence reacts to fragments in an array (chip) via base complementary property

Base coverage • A sample (genome) is amplified • A base is the sample is copied into many reads • But, reads are randomly generated • Poisson distribution • Similarly, k-mers • Still, Poisson distribution, but different

Coverage Depth and Extent • Coverage Depth • The avg number of times each base or k-mer is sequenced • Coverage Extent • The ratio of genome covered by at least one base or k-mer • Given a genome of size G, read length L, read number N • Total number of bases (nb) and k-mers (nk) • nb = N*L; nk = N*(L-k+1) • nb/nb= L/(L-k+1)

Coverage Depth and Extent • Coverage Depth of bases (db) and k-mers (db) • db = nb/G; dk = nk/G • db / dk= L/(L-k+1) • For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (db) from read data before assembly from • G = nk /dk and db = dk* L/(L-k+1)

Coverage Depth or Sequencing Depth • Coverage Depth (db) is called sequencing depth (c) • From Poisson, prob. of non-coverage is • P(X=0) = exp(-c) • Coverage extent is P(X>0) = 1- exp(-c) • To cover >99% of a genome, c>4.6 • To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 • Human genome (3 Gb): c>22

SBH • Given an unknown DNA sequence, DNA array provides • All strings of length l that the sequence contains • No information about their positions • Spectrum (s, l) • For string s of length n, the l-mer composition with multiset of n-l+1 l-mers in s • l=3, s=TATGGTGC • Spectrum(s.l) = {TAT, ATG, TGG, GGT, GTG, TGC}

SBH as a Hamiltonian Path Problem • Two l-mers overlap if overlap(p,q) = l-1 • Hamiltonian Path Problem • Given Spectrum (s, l), and a vertex for every l-mer in Spectrum (s, l) • Connect every two vertices if two vertices overlap, • So that visit every vertex • Overlap-Layout-Consensus (OLC) • NP-complete

OLC • Conventional shotgun sequencing • Overlap-layout-consensus • Use computer to search for overlap: trying for all possible pairs of fragments • Layout: putting fragments together • Consensus: error correction • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Difficult • No solution for “repeat problem” to find correct path in the layout step • Produce sequencing errors • Programs • PHRAP, CAP, TIGR, CELERA

SBH as an Eulerian Path Problem • A graph with all (l-1)-mers (later) • edges corresponding to l-mers from Spectrum (s, l) • Find a path visiting every edge exactly once

Eulerian Path Problem • Repeatedly find Eulerian cycles in the graph • Linear time

De Bruijn Graph • Partition read fragments into fixed-size k-mers • k = 27, for example • Each (k-1)-mer becomes a graph node

OLC vs. De Bruijn Graph

De Bruijn Graph • Eulerian Graph • De Bruijn Graph • Glue parallel links with multiplicity (e.g., multiplicity of 3) • Tangle: # of input edges is not equal to # of output edges

De Bruijn Graph • How to construct de Bruijn graph from collections of sequencing reads ? • Gluing requires knowledge of finished sequence • Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed • Let s be a sequencing read with error • If genome sequence G is known, errors in s can be done by aligning s against G • But G is not known until the last “consensus” step • EULER uses SA to minimize errors in the first step

Programs • ABySS (Assembly By Short Sequencing) • Simpson, 2009 • www.cs.uml.edu/~kim/580/08_abyss.pdf • Velvet • Zerbino and Birny, 2008 • www.cs.uml.edu/~kim/580/08_velvet.pdf • Euler • Pevzner, 2001 • www.cs.uml.edu/~kim/580/01_pevzner.pdf • www.cs.uml.edu/~kim/580/09_chaisson.pdf • SOAPdenovo (Short Oligonucleotide Alignment Program) • Beijing Genomics Institute • www.cs.uml.edu/~kim/580/09_soap.pdf

ABySS • Proceeds in two stages • Stage 1 • All possible k-mers are generated from reads • Remove read errors and construct initial contigs • Stage 2 • Use mate-pairs to extend contigs • Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers

ABySS – Stage 1 • Three steps • Load read data into distributed de Bruijn graph • Resolve read errors • Merge graph nodes • Load read data into distributed de Bruijn graph • Reads with unknown bases are discarded • Each read is broken into (read_length-k+1) overlapping k-mers • A k-mer is assigned to one cluster node • Compute adjacency of k-mers • For each k-mer, a message is sent to its eight possible neighbors • If a neighbor exists, there must be a k-1 bp overlap

ABySS – Stage 1 (cont’d) • Resolve read errors • Remove dead-ends • When correct k-mers of a read connect to incorrect k-mers, • They are likely to be unique and most will not have an extension • One end of the branch will terminate with no extension • Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold • Remove bubbles • A branch diverges and rejoins later • Caused by single base differences

ABySS – Stages 1 and 2 • Vertex merging • Merge vertices linked by unambiguous edges • Contig merging • Use paired-end info

ABySS Results • Genome of African male from NCBI Short Read Archive: Accession # SRA000271 • 3.5 Billion mate-paired reads, x42 • Read length: 36-42 bp, median fragment 210 bp • At k=27, 15h run time without paired-end info

ABySS Comparisons

Velvet • Construct a graph • Transform reads into roadmaps • From a read, generate k-mers with read ID and position in the read (called roadmaps) • Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers • 2nd database • For each read, which k-mers are overlapped by subsequent reads • Trace reads through the graph using roadmaps

Velvet • Graph simplification • A node with one outlink can be combined with a next node with one input link • Error removal • Focus on topological features • Tips (dead-ends) shorter than 2k • bubbles due to internal read errors (Tour Bus algorithm) • Erroneous connections due to distant merging tips • Breadcrumb – use read pairs to extend contigs

EULER, 2001 • EULER • Implement Eulerian Path problem • Issues with real data • Reads may have errors • Error correction is typically done in ‘consensus’ stage • EULER corrects errors in the first step • SA (Spectral Alignment) • Repeat problem • De Bruijn graph

Spectral Alignment (SA) • Genome sequence G is not known, set Gl of all l-mers present in G can be accurately predicted • An l-mer is called solid if it belongs to more than M reads • EULER Approach – approximate Gl as a set of all solid l-mers • SBH problem without read errors • Construct a graph with edges corresponding to l-mers from Spectrum(s, l) • Find a path visiting every edge exactly once • SA • Given a string s and Gl, find the minimum number of mutations in s that transform s such that Spectrum(s, l) = Gl • Can be efficiently programmed by dynamic programming

Spectral Alignment (SA) • Formulation • Given a set of reads R = {r1, .., rn}, integer l, and upper bound Δ on the number of errors in each read • Spectrum Sl is a set of all l-mers from reads r1, .., rn and reverse complements r1’, .., rn’ • Introduce up to Δ corrections in each read in R such that | Sl | is minimized • Result • One correction in a read can correct l from R andl from R’ • Reduces 86.5% of read errors • But, can create errors • One change in a read may change all reads in the region • Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome • Correct 234,410 errors, introduce 1,452 errors in NM

EULER - Results • No incorrect contigs

Summary of EULER • Eulerian Path approach – de Bruijn graph • Do error correction early – SA • Fill gaps ASAP

EULER +, 2004 • EULER+, 2004 • A-Bruijn graph • To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l-mer in de Bruijn assembly • Graph simplication algorithms to remove errors in edges • De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing

Genomic Sequencing

Genomic Sequencing

Presentation Transcript

Introduction to Genomic Sequencing Assembly Annotation Diego Martinez

Sequencing

Genomic sequencing and its data analysis

Detection of Genomic Rearrangements in K562 cells using Paired End Sequencing

SEQUENCING

Realized Genomic Relationships and Genomic BLUP

Genomic sequencing for the Cancer Genome Atlas and clinical translation

Sequencing

Sequencing

Sequencing

The Undergraduate Genomic Sequencing Project

Sequencing

Sequencing

Sequencing

Update on the genomic mapping and sequencing initiative for banana and plantain

Sequencing

Oklahoma Medicago truncatula Genomic Sequencing: Progress-Process-People

Sequencing By Hybridization – A Simulation Study of Performance on Genomic Sequences

GENOMIC IMPRINTING

Phymatotrichopsis omnivora genomic sequencing

Genomic sequencing and its data analysis

Sequencing