1 / 20

Presentation For VLSI Application of Suffix Trees

Presentation For VLSI Application of Suffix Trees. Chunkai Yin Apr 2002. Topics. Approximate palindromes and repeats Multiple common substring problem. Approximate palindromes and repeats. Definitions Find all tandem repeats (naive method) Find all tandem repeats (Landau-Schmidt)

salene
Télécharger la présentation

Presentation For VLSI Application of Suffix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation For VLSIApplication of Suffix Trees Chunkai Yin Apr 2002

  2. Topics • Approximate palindromes and repeats • Multiple common substring problem

  3. Approximate palindromes and repeats • Definitions • Find all tandem repeats (naive method) • Find all tandem repeats (Landau-Schmidt) • Find all k-mismatch tandem repeats (L-S)

  4. Definitions • Tandem repeat string written as AA, where A is a substring abcabc Specified by starting position and length of A • K-mismatch palindrome/tandem a substring that becomes a palindrome/tandem after k or fewer characters are changed zzcbbcca 2-mismatch palindrome pzcb + bcca axabaybb 2-mismatch tamdem axab + aybb

  5. Find all tandem repeats(naive method) • Guess a start position i , middle position j for the tandem • Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i. • Time complexity O(n2) • EX1 fababd

  6. Problem: Find all tandem repeats The Landau-Schmidt method divide the problem into four subproblems (Recursive, divide-and-conquer) • Find all tandem repeats contained entirely in the first half of S (up to h = n/2) • Find all tandem repeats contained entirely in the second half of S (after h ) • Find all tandem repeats where the first copy contains position h of S. • Find all tandem repeats where the second copy contains position h of S.

  7. How to handle this ? • Subproblem A, B will be solved by recursively applying the method • Subproblem C and D are symmetric, just consider subproblem C. Therefore, the core problem is subproblem C.

  8. Solve the subproblem 3 • Initialization: h= n/2; q=h+L; L is a fixed number each time • Compute the longest common extension (from left to right) from h and q. L1 is the length of extension. • Compute the longest common extension (from right to left) from h-1 and q-1. L2

  9. Solve the subproblem 3 • If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point. • EX2

  10. L1, L2 and range for starting point L q A B h Δ1=L-L2, L= Δ1 +L2 A=h-L2 Δ2=L-L1, L= Δ2 +L1 B=h- Δ2=h+L1-L Δ2 L1 Δ2 L1 L2 Δ1 L2 Δ1

  11. An Example ….b d c a b b d c a b b t t t…. h q L=5 L1=3 L2=3 A:1 B:2 Two forms available: bdcab + bdcab dcabb + dcabb

  12. Time Complexity • Time used for output O(Z) Z is the total number of tandem repeats in S • Time used for the extension queries which is proportional to the number of extension queries ext. T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-) A B C D T(n)= O(nlogn), total=O(nlogn)+Z

  13. Find all k-mismatch tandem repeats (L-S) • Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats. • Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1. • Then try to find t, a midpoint of the tandem repeat. In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.

  14. Time complexity O(knlogn +Z)

  15. Multiple common substring prob. • Problem: What substring are common to a large number of distinct strings? • Definitions T: a generalized suffix tree for K strings **The identical suffixes appearing in more than one string end at distinct leaves. Each leaf has one of K unique string identifier.

  16. C(v): the number of distinct leaf string identifiers in the subtree of v • S(v): the total number of leaves in the subtree of v • U(v): how many duplicate suffixes from the same string occur in v’s subtree. • C(v)=S(v)-U(v) • L(k): the length of the longest substring common to at least k of the strings. • U(v)=i:ni(v)>0(ni(v)-1) • ni(v): the number of leaves with identifier i in the subtree rooted at v

  17. The key work we need do • Our object: get C(v) • What is the algorithm? • Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers

  18. Algorithm for the problem • Build a generalized suffix tree T for the K strings. • Number the leaves of T as they are encountered in a depth-first traversal of T • For each string identifier i, extract the ordered list Li, of leaves with identifier i. • For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.

  19. Algorithm Cont. Lemma: If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v. • With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v] • Set C(v)=S(v)-U(v) for each v • Accumulate the table of L(k) values. • Time complexity: O(n)

  20. End of the presentation Thank you Thank you Thank you

More Related