180 likes | 290 Vues
This paper presents a novel approach to frequent string mining that optimally utilizes space under various frequency constraints. By integrating sophisticated suffix array techniques, the proposed algorithm achieves linear time complexity and significantly reduces working space needed compared to previous methods, making it highly space-efficient. The technique effectively processes collections of documents, allowing for rapid identification and differentiation of frequent patterns while maintaining minimal memory usage. This work was showcased at the Workshop on Compression in Santiago, Chile, providing valuable insights into efficient string processing.
E N D
Space-Efficient String Mining under Frequency Constraints Johannes Fischer Ludwig-Maximilians-Universität München Veli Mäkinen and Niko Välimäki University of Helsinki
Frequent string mining : optimal time • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • String mining under several kind of frequency constraints can be done in optimal linear time using suffix array techniques [FHK06]. DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile
Frequent string mining : optimal space? • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • Problem: Can string mining be done using assymptotically the same space as what is needed for storing the string collection? DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile
Our result: Space-efficient string mining • Given a collection C of d documents with overall length n=||C||=∑{T C}|T|, where T Σ*, T C. • We give a string mining algorithm that uses • O(n log |Σ|+d log n) bits of working space and • O(n log n) time. • Since usually d << n, the solution is significantly more space-efficient than previous ones that use O(n log n) working space. Workshop on Compression, Santiago, Chile
High-level description • Tight integration of Kasai et al. [Kasetal01] algorithm to visit all branching substrings of a text and Hui's [Hui92] color set size technique. • Toolbox: compressed suffix array, compressed LCP values, range minimum queries, searchable partial sums. Workshop on Compression, Santiago, Chile
Overview without compressed structures RMQ(LCP,8,14)=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 a a b a # a b a a a b # b b a b b # a b b a # 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # T: SA: LCP: Workshop on Compression, Santiago, Chile
Right-most path of suffix tree a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Suffixes-insertion algorithm a b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Maintain only the right-most path a • Once a node is popped,its subtree is ready, and all statistics for the substring ending to the node can be reported b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Hui's algorithm • Store at each node v of suffix tree • the values: • S[v]: number of leaves in the subtree of v, and • C[v]: number of dublicateoccurrences of the substring ending at node v. a a S[v]=3 C[v]=1 S[v]-C[v] tells how many different documents there are in the subtree of v. AKA S[v]-C[v] defines the frequency of the substring ending at node v. D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [1/5] • Right-most path is kept in a specialstack: • Relative string depths are coded using Elias codes. • Takes O(n) bits. • Allows constant time pop/push. a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [2/5] • Preliminary counter S[v] values along the right-most path are encoded identically as the stack. • Once a node v popped its S[v] value is final and this value is added to its parent. • O(n) bits with constant time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [3/5] • Preliminary counter C[v] values along the right-most path are encoded using a dynamic searchable partial sumsstructure. • Once a node v popped its C[v] value is final and this value is added to its parent. • O(n) bits with O(log n) time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [4/5] • Table D encodes document numbers where suffixes belong to in lex. order. • Predecessor-query on D gives the previous occurrence inside the same document. • RMQ-between the two occurrences gives the string depth where the C[v] counter should be incremented. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile
Making it all space-efficient [5/5] • Table D does not need to be stored as predecessors can be updated "on-the-fly" using an array pred[1..d]. • Compressed suffix array supportsaccess in O(logε n) time and takes O(n log |Σ|) bits. • A bit-vector B[1,n] marks the document boundaries in the text, so that rank(B,SA[i])=D[i]. • LCP and RMQ structures each take2n(1+o(1)) bits [HS02,FH07]. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile
Extensions • This presentation only sketched how to compute the frequency values inside one document collection. In addition, • the computation is easy to adjust to report patterns occurring frequently in one document collection and infrequently in the other; • the computation gives a space-efficient construction algorithm for Sadakane's scheme of stroring the frequency values [Sad07]; and • other compressed text indexes can be plugged in to obtain other space/time tradeoffs. Workshop on Compression, Santiago, Chile
Epilogue • Thanks to the discussions with Luis Russo after the workshop, we were able to improve the space from O(n log d) to O(d log n). • The presentation has been changed accordingly. Workshop on Compression, Santiago, Chile
References [FHK06] Johannes Fischer, Volker Heun, Stefan Kramer: Optimal String Mining under Frequency Constraints, Proc. PKDD'06, LNAI 4213, pages 139-150, 2006. [FH07] Johannes Fischer, Volker Heun: A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007. [FMV07] Johannes Fischer, Veli Mäkinen, Niko Välimäki: Space-efficient String Mining under Frequency Constraints. Submitted. [HS02] Wing-Kai Hon, Kunihiko Sadakane: Space-Economical Algorithms for Finding Maximal Unique Matches. In Proc. CPM 2002, LNCS 2373, pages 144-152, 2002. [Hui92] Lucas Hui: Color Set Size Problem with Application to String Matching. In Proc. CPM 1992, LNCS 644, pages 230-243, 1992. [Kasetal01] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, Kunsoo Park: Linear-Time Longest- Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. CPM 2001, LNCS 2089, pages 181-192, 2001. [Sad07] Kunihiko Sadakane: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1): 12-22 (2007) Workshop on Compression, Santiago, Chile