1 / 22

A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence

A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence. Hiroki Arimura (Hokkaido University) Takeaki Uno (National Institute of Informatics).

wendi
Télécharger la présentation

A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence Hiroki Arimura (Hokkaido University) Takeaki Uno (National Institute of Informatics) This work is partly supported by MEXT Grant-in-Aid for Specially Promoted Research "Semi-structured Data Mining", 2005-2007 and Cooperative Fund by National Institute of Informatics 2005

  2. Our problem: Maximal Motif Enumeration • An integerq 0 (quorum) • An input strings in S* q= 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC pos = 7 pos = 15 pos = 0 • Motif with wildcards:- a string x in(S {o}) starting and ending with a constant symbol in S. • Motif must be requent:: |L(x)| q ABoAB A motif • The problem of enumeratingall maximal motifs in an input sequence without duplicates for the class of repeated motifs with wildcards • AlphabetS = {A, B, ...} and the wildcard "o" • Matching • Location list : the list L(x) = {pos1, ..., posm}of the positions of x in s.

  3. Our problem: Maximal Motif Enumeration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC non-maximal BoAB pos = 8 pos = 16 pos = 1 is contained in ABoAB maximal pos = 7 pos = 15 pos = 0 • A maximal motif: A representative motif x that is not properly contained in any other motif ywith the same location list under displacement. • There exists no motif ysuch that (1) x in contained in y and (2) L(x) = L(y) + dfor some (possibly negative) integer d • Motifx • Location listL(x)

  4. Our problem: Maximal Motif Enumeration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC • The problem of enumeratingall maximal motifs in an input sequence without duplicates for the class of repeated motifs with wildcards • An integerq 0 (quorum) • An input strings in S* q= 3 • Task: Enumerate all maximal motifs without duplicates Solutions ABRAB AB BoABoAB e ABoAB B BoABoooooB BoAB BC BoooooB

  5. Why maximal motifs? • In real datasets, the number of maximal motifs is much smaller than that of motifs containing the complete information • Succinct representation for all (frequent) motifs

  6. How many solutions? • How many solutions • Th 1: There exist 2Q(n) maximal motifs in s in general. • Succinctness of maximal motifs • Th 2: There exists an infinite series of input strings (sn)n such that the numbers of the (frequent) motifs and the maximal (frequent) motifs in sn are 2W(n) and O(n), resp. • By reduction from maximal bi-clique enum. problem (closed set enum.) • From this thm, we know that a naive generate-and-test algorithm using frequent motif enumeration does not work for maximal motifs • Hardness of counting • Fact 3: (not included in the paper) The counting version of the maximal motif enumeration is #P-complete • By reduction for TH2 and the #P-completeness of the maximal bi-clique enumeration problem.

  7. Classes of Enumeration Algorithms • No known output-polynomial time algorithms exist for maximal motif enumeration Input size M Input • Output-polynomial (OUT-POLY) • Total time is poly(Input, Output) Output size M • polynomial-time enumeration (POLY-ENUM) • Amotized delay is poly(Input), or • Total time is Output·poly(Input) Delay D Total Time T + • polynomial-delay (POLY-DELAY) • Maximum of delay is poly(Input) • polynomial-space(POLY-SPACE)

  8. Related works An approach with the basis of maximal motifs Generating the set M of maximal motifs from a small subset B • Parida et al. [SODA'00] • The basis of irredundant motifsBI • Claimed that the size of BI is at most linear in n = |s| for any quorum q 0 [Th 1., Parida SODA'00], which finally turns out false by [Pisanti et al. MFCS'03] • Parida et al. [CPM'00] • Output poly-time enumeration of flexible motifs using the claim of [Parida et al. SODA'00]. • Pisanti et al. [MFCS'03] (Pelfrene et al.[CPM'03]) • The basis of tiling motifsBT • Showed that the size of BI and BT are Q(nq-1) [Pisanti et al., MFCS'03]. • It is not known whether there exists any output-polynomial time algorithm for enumerating all maximal motifs from input sequence s.

  9. Main result • Th 8: There exists an algorithm that enumerates all maximal motifs in an input sequence with quorum q in O(n2m) delay and O(km) spacewithout duplicates. • n: the length of input sequence s • k: the length of a longest maximal motif (k = O(n)). • m: the length of the location list L(x) of motif x (m = O(n)). • Corollary 2: The maximal motif enumeration problem is solvable in polynomial space polynomial delay in the input size n. This seems to be the first output-polynomial time result for the maximal motif enumeration problem.

  10. Polynomial space polynomial delay Algorithm

  11. Basic Idea: Incremental Generation by Back-tracking ⊥ maximal BoABoAB B BoAB maximal AB BoABoooooB BC BoooooB ABoAB ABRAB BCoB no maximal BCA no maximal motifs exist

  12. Difficulties in enumerating all maximal motifs • How to test the maximality of a generated motif? • How to define tree-shaped search route over all maximal motifs? • How to perform depth-first search on the tree with polynomial space and delay? BC ABoAB BCoB ABRAB BCA

  13. The closure Clo(x) of a motif x Procedure Closure(Q) := Merge(L(Q)) Q BAB L • STEP1: Compute the location list of Q: L(Q) = {d1, ..., dm} AB|BCABRABRABCABABRA... AB|BCABRABRABCABABRA... • STEP2: Align the copies of the input sequences at the occurrence positionS(P) = {s - d1, ..., s - dm} Merge ABBCA|BRABRABCABABRA... ABBCABRA|BRABCABABRA... Closure Clo(Q) • STEP3: Compute the common letters at each positions.return R = Merge(S(P)) BABAB

  14. How to define a tree-shaped search route? • The parent of a maximal motif Q Pa(Q) = Clo( Q[1..kmin ] ) where kmin = core_i(Q)-1 is the core index of Q. • Lemma: The parent relation Pa(.) defines a spanning tree over all maximal motifs as a tree-shaped search route. • Assign to each maximal motif y the unique parent Pa(y)

  15. Prefix-Preserving Closure Extension • Given a parent, compute all of its children • Input: the parent maximal motif X, and its "core index"k = core_i(x) • The length of the shortest prefix p of x that has equivalent location list, L(x) = L(p) • method: For all index ∀i = k+1, ..., nand all letter c = c1 ...,cs (∈S), do the followings: • Q := P 〈i := c〉 • Compute the closure R := Clo( Q ) • Check the prefix is idnetical check if P[1 .. i-1] = R[1 .. i-1]? • if succeeded, then return the motif R Input Sequence

  16. Difficulties in enumerating all maximal motifs • How to test the maximality of a generated motif? • How to define tree-shaped search route over all maximal motifs? • How to perform depth-first search on the tree with polynomial space and delay? BC ABoAB BCoB ABRAB BCA

  17. Algorithm MAXMOTIF • A polynomial space polynomial delay algorithm for maximal motifs • Based on depth-first search using PPC-extension for maximal motifs

  18. Out method (MAXMOTIF) • Depth-search of tree • Memory efficient, quick, simple Comparison to the previous methods • Previous method • Breadth-first search of dag • large memory footprint Memory = (Depth of tree) X (lenght of location list) Memory proportional to the output size Input Sequence

  19. Main result • Th 8: There exists an algorithm that enumerates all maximal motifs in an input sequence with quorum q in O(n2m) delay and O(km) spacewithout duplicates. • n: the length of input sequence s • k: the length of a longest maximal motif (k = O(n)). • m: the length of the location list L(x) of motif x (m = O(n)). • Corollary 2: The maximal motif enumeration problem is solvable in polynomial space polynomial delay in the input size n. This seems to be the first output-polynomial time result for the maximal motif enumeration problem.

  20. Experiments

  21. Conclusion • Maximal motif enumeration problem • Difficulties in maximal motif enumeration • A polynomial time polynomial delay algorithmMAXMOTIF • Enumerates all maximal motifsx in an input string of length n in O(n2m)delay and O(lm) space without duplicates, where m = |L(x)| and l = |x|. • Output-polynomial time enumerability of the problem • Future research • Extension of the algorithm for the maximal motif problems over combinatorial objects such as trees and graphs [Arimura and Uno, ILP2005]

More Related