1 / 35

Function Matching: Algorithm, Application, and a Lower Bound

Function Matching: Algorithm, Application, and a Lower Bound. 生物資訊演算法報告 資訊三 B90902003 張譽馨 資訊二 B91902051 吳俐瑩. Function Matching. Function matching in computational biology.

ikia
Télécharger la présentation

Function Matching: Algorithm, Application, and a Lower Bound

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Function Matching: Algorithm, Application, and a Lower Bound 生物資訊演算法報告 資訊三 B90902003 張譽馨 資訊二 B91902051 吳俐瑩

  2. Function Matching • Function matching in computational biology. • The Grand Challenge protein folding problem is one of the most important problem in computational biology. The goal is to determine a protein’s tertiary structure from the linear arrangement of its peptide sequence.

  3. Example

  4. Introduce Abstract: Function Matching captures several different application. It has input T of length n over alphabet ΣT, and a pattern P = P[1]P[2]…P[m] of length m over alphabet ΣP, we seek all text location I for some function f:ΣP ΣT, the m-length substring that starts at I is equal to f(P[1])f(P[2])…f(P[m]).

  5. About This Paper • by Amihood Amir, Yonatan Aumann, Richard Cole, Moshe Lewenstein, and Ely Porat. In Proceedings of ICALP, pages 929-942, 2003 • Three main contribution in this paper: • A new type of generalized matching, that of function matching. • A formalization of new general convolutions model • Efficient randomized and deterministic algorithms for two-dimensional parameterized and function matching.

  6. Algorithms • Definition: • Let U and V be equal length strings. Symbol τ in U is said to cover symbol σ in V if every occurrence of σ in V is aligned with an occurrence of τ in U (i.e. they occur in equal index locations). U is said to cover σ in V if there is some symbol τ in U covering σ. Finally, the cover is said to be an exact cover if every occurrence of τ in U is aligned with an occurrence of σ in V . • U: a b a a c c a b a • V: n k n e d d n s s  U is cover n in V U is exact cover d in V

  7. Algorithms • Definition: • There is a function match of V with U if every symbol occurring in V is covered by U (but this relation need not be symmetric). If each of the covers is an exact cover the match is a parameterized match (and this relation is symmetric). • U: a a b b a a a c • V: e e h h e n n f  a function match • U: a a b b a a a c • V: e e h h e e e f  a parameterized match

  8. Algorithms • Definition: • Given a text T (of length n) and a pattern P (of length m) the function matching problem is to find the alignments (positionings) of P such that P function matches the aligned portion of T. Note that every match may use a different function to associate the symbols of P with those in the aligned portion of T. • As is standard, we can limit T to have length at most 2m, by breaking T into pieces of length 2m. • Naïve algorithm • Simply check each possible alignment of the pattern in turn, each time O(m), so all the time for function matching is O(nm).

  9. Time Complexity-Naive algorithm P = T = S = a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c a b c a a c v z x s z x x z v s x g s s g s a z 3 3,10 Time Complexity = O(nm)

  10. Algorithms • Giving a O(|ΣP| |ΣT| logm) time algorithm. • Definition: • The σ-indicator of string U, χσ(U) is a binary string of length U in which each occurrence of σ is replaced by a 1, and every other symbol occurrence is replaced by 0. • The procedure used the strings χσ(P) and χσ(T). For each alignment of χσ(P) with χσ(T) it computes the dot product of χσ(P) and the aligned portion ofχσ(T). This is a cover of σ by τ exactly if the dot product equals the number of occurrences of σ in P.

  11. Algorithms • Giving a O(|ΣP| |ΣT| logm) time algorithm. • Example: U: a a b b a a a c  1 1 0 0 1 1 1 0 (a 1) V: e e h h e n n f  1 1 0 0 1 0 0 0 (e 1) dot product: 1+1+0+0+1+0+0+0=3 the dot product equals to the number of occurrences of e in V. e in V is covered by U. • The dot products, for each alignment of χσ(P) with χσ(T), are all readily computed in O(n logm) time by means of a convolution. Alignment of a σ in P with all τ in T needs O(n |ΣT| logm). For all σ in P, needs O(n|ΣP| |ΣT| logm) .

  12. Algorithms • Giving a O(n|ΣP| logm) time algorithm. • Lemma: Let a1, ..., ak be natural numbers. Then k Σkh=1(ah)2= (Σkh=1ah)2 iff ai= aj, for 1 ≤ i < j ≤ k: k(a12+a22…ak2)=(a1+a2….+ak)2 • The algorithm uses the strings T and T2, where T2is defined by T2[i] =(T[i])2, i = 0, ..., n-1. By Lemma 1 T covers σ in a given alignment exactly if the dot product of P with the aligned portion with T’s square is k times larger than the dot product of P of T2, where k is the number of occurrences of σ in P.

  13. Algorithms • Giving a O(n|ΣP| logm) time algorithm • Example: U: a a b c a  U’= 1 1 0 0 1 (that a=1) V: c c d d c  T=2 2 1 1 2 T2=4 4 1 1 4 U’˙T=2+2+2=6 U’˙T2=4+4+4=12 6 * 6 = k * 12 By Lemma we can know a is covered by V

  14. Algorithms • Giving a O(n|ΣP| logm) time algorithm • Time: since we can compare all symbol in the portion of the text with one σ in the pattern at the same time, we don’t need the time |ΣT| wasted. Just check every symbol in P with a portion of T. Again The dot products, for each alignment of χσ(P) with χσ(T), are all readily computed in O(n logm) , so all time we need is O(n|ΣP| logm).

  15. Algorithms • Giving a O(kn logm) time algorithm which can allow a 1/nk probability of saying non-match as a match. • Define: create a new text T’, whose length is 2n, and a new pattern P, whose length is 2m. There will be a match of P with T starting at location i in T exactly if there is a match of Pstarting at location 2i-1 in T.’ If T = abca  T’= aabbccaa If P = aabba  P’=a1a2a2a3b1b2b2b3a3a4

  16. Algorithms • Giving a O(kn logm) time algorithm. • Define: That each different symbol choose one integer uniformly from the range [1, 2nk+1]: T’= aabbccaa  T’’=1 1 2 2 3 3 1 1 P’=a1a2a2a3b1b2b2b3a3a4 P’’=0 1 -1 2 0 3 -3 0 -2 0 (The first occurrence of σ is replaced by uσand the second occurrence by -uσ; the symbol occurs once is replaced by 0.)

  17. Algorithms • Giving a O(kn logm) time algorithm. • for each possible alignment of Pwith T, the dot product of Pwith the aligned portion of Tis computed. Clearly, if there is a function match of P with T, the corresponding dot product evaluates to 0.

  18. Algorithms • Giving a O(kn logm) time algorithm. • Example: • U = aabba  U’ = aaaabbbbaa  U’’= 7 7 7 7 9 9 9 9 7 7 V = aabba  V’ = a1a2a2a3b1b2b2b3a3a4  V’’= 0 1 -1 2 0 3 -3 0 -2 0 U’’˙V’’ = 7+(-7)+14+27+(-27)+(-14)=0 (match!!) • U = aabca  U’=aaaabbccaa  U’’=7 7 7 7 9 9 8 8 7 7 V = aabaa  V’ = a1a2a2a3b1b2a3a4 a4a5 V’’= 0 1 -1 2 0 0 –23 –3 0 U’’˙V’’=7+(-7)+14+(-16)+(24)+(-21)=1 (mismatch!!)

  19. Algorithms • Giving a O(kn logm) time algorithm. • But there might be some mistakes! See example 2, if we choose the integer of c in U’’ the same with a or the integer of a4 the same with a3 in V’’, then the dot product is 0, but obviously it is not match!!! So since we have the probability of 1/ 2nk+1 to choose the same integer between b and c in U’’, and 1/ 2nk+1 to choose the same integer between a4 anda3 in V’’, we have at most a 2/2nk+1 = 1/nk+1probability of this polynomial evaluating to 0. As there are n-m+1 possible alignments of P with T, the overall failure probability is at most 1/nk.

  20. Algorithms • Giving a O(kn logm) time algorithm. • Time: Since every check between P and one portion of T can be done at once, so we just need the time check each portion of T and give the random number to T. Thus, we need only O(kn logm) time to complete this job. • We have shown: There is a randomized algorithm for function matching that, given a constant k, runs in time O(kn logm);it reports all function matches and, with probability at least 1-1/nk reports no mismatches as matches.

  21. Convolution Model Symbols: if a matches b if a mismatches b if x = a if x ≠ a if x ≠ a if x = a if x <a if x ≥ a

  22. Convolution Model t1 t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . .

  23. Convolution Model p1 p2 p3 p4 . . . pm t1t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1p2t2p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2p3t3p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtmpmtm+1 . . pmtn-1 pmtn . . . . . .

  24. Convolution Model p1 p2 p3 p4 . . . pm t1t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . .

  25. Example - String Matching with Don’t Cares

  26. T = a b a b c a b d a s c a s x P = a b a c c PR = c c a b a Ta = 1 0 1 0 0 1 0 0 1 0 0 1 0 0 PRb = 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0

  27. T = a b a b c a b d a s c a s x P = a b a c c PR = c c a b a Tb = 0 1 0 1 0 0 1 0 0 0 0 0 0 0 PRa = 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 1 1 0 1 0 0 0 0 0 0 0 S = 1,3,6,9,10

  28. Two Dimensional Algorithm Input : Two dimensional text T of size n*n, and two dimensional pattern P of size m*m. Output: All location [ i , j ] in T where there is a parameterized occurrence of the pattern. Idea: that two dimensional text and pattern are written in row major to give one dimensional text and pattern.

  29. Text m Pattern n Line 1 Line 2 n-m n-m P’ = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? T’= … … There is a function match at location ( i, j ) if there is a match of P’ at location n(i-1)+j in T’

  30. Two Dimensional Algorithm Since |T’|=n2 , |P’|=nm, we can do function match between T’ and P’ in O(kn2lognm)  O(kn2logn) Corollary : There is a randomized algorithm for two dimensional function matching , which given a constant k, runs in O(kn2lognm) time, report all function matches, and with probability at 1/nk falsely reports a mismatch as a match.

  31. Two Dimensional Algorithm- a parameterized match We have known that for each m*m subarray of an n*n array, the number of distinct characters appearing in the subarray can be done in O(n2logn) time. – By Church and Dar. So we can check if every align portion of T’ and P’ have the same number of distinct characters, then we can finish a parameterized match.  The time is still O(kn2logn).

  32. Two Dimensional Algorithm- a deterministic parameterized algorithm • ( w, y) and ( x, z) ‘s relative position is ( x-w, z-y). • Each symbol in P or T is replaced by an equal length sequence relative position. What we choose is some of the same symbol occurs in P (or T) and the original symbol’s relative position. • All the occurrence of a in the pattern are linked, for each symbol a.

  33. Two Dimensional Algorithm- a deterministic parameterized algorithm How we change the two dimensional pattern to one dimensional pattern: Select the neighbors of one symbol. divide the m*m pattern into some disjoint rectangles. Each rectangles provides one neighbor.

  34. Two Dimensional Algorithm- a deterministic parameterized algorithm 1 2 3 4 5 6 7 8 1 Select the first same symbol occurs a b a c c c a b a b b c a a c c b b c c a b b c c c a a b b b b a a a a a c c a a b b b c c c a a c b b b c b c b b b a a a a a So the symbol in position (2,2) will record to (0,0) (0,1) (0,0) (0,0) (6,1) (0,0) (0,0) (1,0) (1,-1) (1,0) (4,1) (2,3) (1,4) For every symbol occurs in P, we change the Symbol to a sequence of relative position  Get P’. 2 3 4 5 6 7 8 Each rectangle contains following numbers of rows: 1,2,4,…,2i-2,2i-2,2i-3,…,4,2,1,1

  35. Two Dimensional Algorithm- a deterministic parameterized algorithm • The process of T is the same with P. • The Two Dimensional function changes to one dimensional wildcard string matching. That the relative position (0,0) be the wildcard. - It can be done in O(n2log2m). Why??

More Related