Function Matching: Algorithm, Application, and a Lower Bound

Function Matching: Algorithm, Application, and a Lower Bound 生物資訊演算法報告資訊三 B90902003 張譽馨資訊二 B91902051 吳俐瑩

Function Matching • Function matching in computational biology. • The Grand Challenge protein folding problem is one of the most important problem in computational biology. The goal is to determine a protein’s tertiary structure from the linear arrangement of its peptide sequence.

Example

Introduce Abstract: Function Matching captures several different application. It has input T of length n over alphabet ΣT, and a pattern P = P[1]P[2]…P[m] of length m over alphabet ΣP, we seek all text location I for some function f:ΣP ΣT, the m-length substring that starts at I is equal to f(P[1])f(P[2])…f(P[m]).

About This Paper • by Amihood Amir, Yonatan Aumann, Richard Cole, Moshe Lewenstein, and Ely Porat. In Proceedings of ICALP, pages 929-942, 2003 • Three main contribution in this paper: • A new type of generalized matching, that of function matching. • A formalization of new general convolutions model • Efficient randomized and deterministic algorithms for two-dimensional parameterized and function matching.

Algorithms • Definition: • Let U and V be equal length strings. Symbol τ in U is said to cover symbol σ in V if every occurrence of σ in V is aligned with an occurrence of τ in U (i.e. they occur in equal index locations). U is said to cover σ in V if there is some symbol τ in U covering σ. Finally, the cover is said to be an exact cover if every occurrence of τ in U is aligned with an occurrence of σ in V . • U: a b a a c c a b a • V: n k n e d d n s s  U is cover n in V U is exact cover d in V

Algorithms • Definition: • There is a function match of V with U if every symbol occurring in V is covered by U (but this relation need not be symmetric). If each of the covers is an exact cover the match is a parameterized match (and this relation is symmetric). • U: a a b b a a a c • V: e e h h e n n f  a function match • U: a a b b a a a c • V: e e h h e e e f  a parameterized match

Algorithms • Definition: • Given a text T (of length n) and a pattern P (of length m) the function matching problem is to find the alignments (positionings) of P such that P function matches the aligned portion of T. Note that every match may use a different function to associate the symbols of P with those in the aligned portion of T. • As is standard, we can limit T to have length at most 2m, by breaking T into pieces of length 2m. • Naïve algorithm • Simply check each possible alignment of the pattern in turn, each time O(m), so all the time for function matching is O(nm).

Time Complexity-Naive algorithm P = T = S = a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c v z x s z x x z v s x g s s g s a z 3 3,10 Time Complexity = O(nm)

Algorithms • Giving a O(|ΣP| |ΣT| logm) time algorithm. • Definition: • The σ-indicator of string U, χσ(U) is a binary string of length U in which each occurrence of σ is replaced by a 1, and every other symbol occurrence is replaced by 0. • The procedure used the strings χσ(P) and χσ(T). For each alignment of χσ(P) with χσ(T) it computes the dot product of χσ(P) and the aligned portion ofχσ(T). This is a cover of σ by τ exactly if the dot product equals the number of occurrences of σ in P.

Algorithms • Giving a O(|ΣP| |ΣT| logm) time algorithm. • Example: U: a a b b a a a c  1 1 0 0 1 1 1 0 (a 1) V: e e h h e n n f  1 1 0 0 1 0 0 0 (e 1) dot product: 1+1+0+0+1+0+0+0=3 the dot product equals to the number of occurrences of e in V. e in V is covered by U. • The dot products, for each alignment of χσ(P) with χσ(T), are all readily computed in O(n logm) time by means of a convolution. Alignment of a σ in P with all τ in T needs O(n |ΣT| logm). For all σ in P, needs O(n|ΣP| |ΣT| logm) .

Algorithms • Giving a O(n|ΣP| logm) time algorithm. • Lemma: Let a1, ..., ak be natural numbers. Then k Σkh=1(ah)2= (Σkh=1ah)2 iff ai= aj, for 1 ≤ i < j ≤ k: k(a12+a22…ak2)=(a1+a2….+ak)2 • The algorithm uses the strings T and T2, where T2is defined by T2[i] =(T[i])2, i = 0, ..., n-1. By Lemma 1 T covers σ in a given alignment exactly if the dot product of P with the aligned portion with T’s square is k times larger than the dot product of P of T2, where k is the number of occurrences of σ in P.

Algorithms • Giving a O(n|ΣP| logm) time algorithm • Example: U: a a b c a  U’= 1 1 0 0 1 (that a=1) V: c c d d c  T=2 2 1 1 2 T2=4 4 1 1 4 U’˙T=2+2+2=6 U’˙T2=4+4+4=12 6 * 6 = k * 12 By Lemma we can know a is covered by V

Algorithms • Giving a O(n|ΣP| logm) time algorithm • Time: since we can compare all symbol in the portion of the text with one σ in the pattern at the same time, we don’t need the time |ΣT| wasted. Just check every symbol in P with a portion of T. Again The dot products, for each alignment of χσ(P) with χσ(T), are all readily computed in O(n logm) , so all time we need is O(n|ΣP| logm).

Algorithms • Giving a O(kn logm) time algorithm which can allow a 1/nk probability of saying non-match as a match. • Define: create a new text T’, whose length is 2n, and a new pattern P, whose length is 2m. There will be a match of P with T starting at location i in T exactly if there is a match of Pstarting at location 2i-1 in T.’ If T = abca  T’= aabbccaa If P = aabba  P’=a1a2a2a3b1b2b2b3a3a4

Algorithms • Giving a O(kn logm) time algorithm. • Define: That each different symbol choose one integer uniformly from the range [1, 2nk+1]: T’= aabbccaa  T’’=1 1 2 2 3 3 1 1 P’=a1a2a2a3b1b2b2b3a3a4 P’’=0 1 -1 2 0 3 -3 0 -2 0 (The first occurrence of σ is replaced by uσand the second occurrence by -uσ; the symbol occurs once is replaced by 0.)

Algorithms • Giving a O(kn logm) time algorithm. • for each possible alignment of Pwith T, the dot product of Pwith the aligned portion of Tis computed. Clearly, if there is a function match of P with T, the corresponding dot product evaluates to 0.

Algorithms • Giving a O(kn logm) time algorithm. • Example: • U = aabba  U’ = aaaabbbbaa  U’’= 7 7 7 7 9 9 9 9 7 7 V = aabba  V’ = a1a2a2a3b1b2b2b3a3a4  V’’= 0 1 -1 2 0 3 -3 0 -2 0 U’’˙V’’ = 7+(-7)+14+27+(-27)+(-14)=0 (match!!) • U = aabca  U’=aaaabbccaa  U’’=7 7 7 7 9 9 8 8 7 7 V = aabaa  V’ = a1a2a2a3b1b2a3a4 a4a5 V’’= 0 1 -1 2 0 0 –23 –3 0 U’’˙V’’=7+(-7)+14+(-16)+(24)+(-21)=1 (mismatch!!)

Algorithms • Giving a O(kn logm) time algorithm. • But there might be some mistakes! See example 2, if we choose the integer of c in U’’ the same with a or the integer of a4 the same with a3 in V’’, then the dot product is 0, but obviously it is not match!!! So since we have the probability of 1/ 2nk+1 to choose the same integer between b and c in U’’, and 1/ 2nk+1 to choose the same integer between a4 anda3 in V’’, we have at most a 2/2nk+1 = 1/nk+1probability of this polynomial evaluating to 0. As there are n-m+1 possible alignments of P with T, the overall failure probability is at most 1/nk.

Algorithms • Giving a O(kn logm) time algorithm. • Time: Since every check between P and one portion of T can be done at once, so we just need the time check each portion of T and give the random number to T. Thus, we need only O(kn logm) time to complete this job. • We have shown: There is a randomized algorithm for function matching that, given a constant k, runs in time O(kn logm);it reports all function matches and, with probability at least 1-1/nk reports no mismatches as matches.

Convolution Model Symbols: if a matches b if a mismatches b if x ＝ a if x ≠ a if x ≠ a if x ＝ a if x <a if x ≥ a

Convolution Model t1 t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . .

Convolution Model p1 p2 p3 p4 . . . pm t1t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1p2t2p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2p3t3p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtmpmtm+1 . . pmtn-1 pmtn . . . . . .

Convolution Model p1 p2 p3 p4 . . . pm t1t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . .

Example - String Matching with Don’t Cares

T = a b a b c a b d a s c a s x P = a b a c c PR = c c a b a Ta = 1 0 1 0 0 1 0 0 1 0 0 1 0 0 PRb = 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0

T = a b a b c a b d a s c a s x P = a b a c c PR = c c a b a Tb = 0 1 0 1 0 0 1 0 0 0 0 0 0 0 PRa = 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 1 1 0 1 0 0 0 0 0 0 0 S = 1,3,6,9,10

Two Dimensional Algorithm Input : Two dimensional text T of size n*n, and two dimensional pattern P of size m*m. Output: All location [ i , j ] in T where there is a parameterized occurrence of the pattern. Idea: that two dimensional text and pattern are written in row major to give one dimensional text and pattern.

Text m Pattern n Line 1 Line 2 n-m n-m P’ = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? T’= … … There is a function match at location ( i, j ) if there is a match of P’ at location n(i-1)+j in T’

Two Dimensional Algorithm Since |T’|=n2 , |P’|=nm, we can do function match between T’ and P’ in O(kn2lognm)  O(kn2logn) Corollary : There is a randomized algorithm for two dimensional function matching , which given a constant k, runs in O(kn2lognm) time, report all function matches, and with probability at 1/nk falsely reports a mismatch as a match.

Two Dimensional Algorithm- a parameterized match We have known that for each m*m subarray of an n*n array, the number of distinct characters appearing in the subarray can be done in O(n2logn) time. – By Church and Dar. So we can check if every align portion of T’ and P’ have the same number of distinct characters, then we can finish a parameterized match.  The time is still O(kn2logn).

Two Dimensional Algorithm- a deterministic parameterized algorithm • ( w, y) and ( x, z) ‘s relative position is ( x-w, z-y). • Each symbol in P or T is replaced by an equal length sequence relative position. What we choose is some of the same symbol occurs in P (or T) and the original symbol’s relative position. • All the occurrence of a in the pattern are linked, for each symbol a.

Two Dimensional Algorithm- a deterministic parameterized algorithm How we change the two dimensional pattern to one dimensional pattern: Select the neighbors of one symbol. divide the m*m pattern into some disjoint rectangles. Each rectangles provides one neighbor.

Two Dimensional Algorithm- a deterministic parameterized algorithm 1 2 3 4 5 6 7 8 1 Select the first same symbol occurs a b a c c c a b a b b c a a c c b b c c a b b c c c a a b b b b a a a a a c c a a b b b c c c a a c b b b c b c b b b a a a a a So the symbol in position (2,2) will record to (0,0) (0,1) (0,0) (0,0) (6,1) (0,0) (0,0) (1,0) (1,-1) (1,0) (4,1) (2,3) (1,4) For every symbol occurs in P, we change the Symbol to a sequence of relative position  Get P’. 2 3 4 5 6 7 8 Each rectangle contains following numbers of rows: 1,2,4,…,2i-2,2i-2,2i-3,…,4,2,1,1

Two Dimensional Algorithm- a deterministic parameterized algorithm • The process of T is the same with P. • The Two Dimensional function changes to one dimensional wildcard string matching. That the relative position (0,0) be the wildcard. - It can be done in O(n2log2m). Why??

Function Matching: Algorithm, Application, and a Lower Bound