String Matching

String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

2. Comparison of short sequences (<10.000 bps) Summary (more or less) • 2.1 Dot matrix • 2.2 Pairwise alignment. • 2.3 Hash algorithms. • 2.4 Multiple alignment.

2. Dot matrix S2 y S1 x Given two sequences, how we can analyse their degree of identity? By searching those parts that match: 1/0 1 if both characters coincide

2. Dot matrix S2 S2 y y . . . . . S1 S1 x x . . 1/0 1 if both characters coincide ? Given two sequences, how we can analyse their degree of identity? By searching those parts that match:

2.1 Dot matrix accaccacaccacaacgagcata… acctgagcgatat a c c . . t • m(i,j)=1 iff S1(i..i+L)=S2(j..j+L): exact matching • m(i,j)=1 iff k over L coincide: approximate matching. • m(i,j)=k iff k over L coincide: approximate matching L=window length What is the cost of the algorithm? When are the matchings relevant?

2.1. Dot matrix: algorithm cost accaccacaccacaacgagcata… acctgagcgatat a c c . . t • long(S1)*long(S2)* L in other words O(n2 L) • can long(S1)*long(S2)be possible?

2.1. Dot matrix: signals C: Random B: S1=S2 A: transposons When are signals statistically significant?

2.1. Dot matrix: statistical significance: Given L=window length S2 y . . . . . S1 x . . We need to define a random model against which to compare the signals: we define RV: X number of characters that coincide, then Prob(X=k)=comb(L,k) pk (1-p)L-k E(X) = Lp What is its expected value?

2.1. Dot matrix: signals C: Random B: S1=S2

2.2 Pairwise alignment Given two DNA sequences A (a1a2...an) and B (b1b2...bm) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff • A* and B* become A and B if gaps ( – ) are removed. • |A*|=|B*| • For all i, it is not possible that ai = bi = - MALIG (an example) or CLUSTAL How many alignments of two sequences exist? Which is the best alignment?

2.2 Number of alignments b1 b2 b3 a1 a2 a3 Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are: #(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-) + #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm) #(a1,b1)

2.2 Number of alignments b1 b2 b3 a1 a2 a3 Given two DNA sequences A (a1a2...an) and B (b1b2...bm) then: #(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with ( an , -) + #(a1a2...an ,b1b2...bm-1) those that end with ( - , bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with ( an , bm) 1 1 1 1 1 1 1 3 5 7 • 25 • 25 63 5 7 But, what is the assymptotic value? #(a1a2...an ,b1b2...bn) > 22n

2.2 Best alignment • Match: favorable • Mismatch: unfavorable • Gap: worst case How can an alignment be scored? catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * *** Then we assign a score for each case, for example 1,-1,-2. How can the best alignment be found?

2.2 Edit distance and alignment of strings The best alignment of two strings … …is related with the edit distance, first discussed in 1966... The most efficient algorithm was proposed in 1968 and in 1970 using the technique called “Dynamic programming”

2.2 Best alignment - C C C C - BA(AC,CTA) BA(A,CTA) BA(A,CTAC) C T A C T A C T A C G T A C T G A C T A C T A C T A C G T 0 -2 -4-6 -8 … A-2 C-4 T -6 G A s(AC,CTA)-2 s(A,CTA)+1 BA(AC,CTAC)= best s(AC,CTAC)=max s(A,CTAC)-2

Best alignment accaccacaccacaacgagcata… acctgagcgatat a c c . . t Given the maximum score, how can the best alignment be found? • Quadratic cost in space and time • Up to 10,000 bps sequences in length Download alggen tool

2.2 Some slides revisited We have developed the theory according to the following principles: 1) Both sequences have a similar length (global). 2) The model of gaps is linear If there are k consecutive gaps the penalty scores k(-2).

2.2 Semiglobal pairwise alignment Assume that we have sequences with different length S1 S2 It is meaningless to introduce gaps until both sequences have similar length …. The most probable alignment should be Initial gaps Final gaps How can these alignments be found?

2.2 Semiglobal pairwise alignment Note that Initial gaps Final gaps C T A C T A C T A C G T A C T

2.2 Semiglobal pairwise alignment Given a cell C T A C T A C T A C G T A C T 0 0 0 0 0 0 0 0 0 0 0 0 0 The cell contains the score of the best alignment of CTA with the empty sequence.

2.2 Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 C T A C T A C T A C G T 0 0 0 0 0 0 0… A C T The contribution of the initial gaps is disregarded, then but, what happens with the final gaps?

2.2 Semiglobal pairwise alignment … by checking the last row for the best score. C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 How does the algorithm search for the best alignment? Practice with the alggen tool.

2.2 Affine-gap model score Given the following alignments that have the same score … a g t a c c c c g t a g a g t - c c - - g t a - a g t a c c c c g t a g a g t - c - c - g t a - a g t a c c c c g t a g a g t - c - - c g t a - a g t a c c c c g t a g a g t - - c c - g t a - a g t a c c c c g t a g a g t - - - c c g t a - a g t a c c c c g t a g a g t - - c - c g t a - Which is the most reliable case from a biological point of view?

2.2 Affine-gap model score By scoring the opening gapsgreaterthan the extension gaps, for instance, -10 and -0.5. Then, how can we distinguish between consecutive gaps and separated gaps? a g t a c c c c g t a g a g t - - - c c g t a - a g t a c c c c g t a g a g t - - c - c g t a - Then, the penalty of k consecutive gaps becomes OG + (k-1) EG which is an affine-gap function. How is the best alignment found?.

2.2 Affine-gap model score C T A C T A C T A C G T A C T G A Smallest arrows: refer to the introduction of an opening gap. Largest arrows: refer to the introduction of an extension gap. But from which cell do the largest arrows originate?

2.2 Affine-gap model score In both cases we know which cell contributes with the minimum penalty score. C T A C T A C T A C G T A C T G A Acces to clustalW: http://www.ebi.ac.uk/clustalw

2.2 Local alignment Given two sequences, we can consider the alignments of all their substrings… …how can the best of them be found? Two questions arise: - how can the alignments be compared? - how can the best one be selected?

2.2 Local alignment accaccacaccacaacgagcata… acctgagcgatat a c c . . t Given a path Imagine the graph of the scores: can the best subalignments be detected? … It suffices to compare the value of each cell with zero!

String Matching

String Matching

Presentation Transcript

String Matching

Approximate String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching Algorithms

String Matching

String matching

Approximate String Matching

String Matching Algorithms

String Matching

String Matching

String Matching