Space-Efficient String Mining under Frequency Constraints

Space-Efficient String Mining under Frequency Constraints Johannes Fischer Ludwig-Maximilians-Universität München Veli Mäkinen and Niko Välimäki University of Helsinki

Frequent string mining : optimal time • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • String mining under several kind of frequency constraints can be done in optimal linear time using suffix array techniques [FHK06]. DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile

Frequent string mining : optimal space? • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • Problem: Can string mining be done using assymptotically the same space as what is needed for storing the string collection? DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile

Our result: Space-efficient string mining • Given a collection C of d documents with overall length n=||C||=∑{T C}|T|, where T  Σ*, T  C. • We give a string mining algorithm that uses • O(n log |Σ|+d log n) bits of working space and • O(n log n) time. • Since usually d << n, the solution is significantly more space-efficient than previous ones that use O(n log n) working space. Workshop on Compression, Santiago, Chile

High-level description • Tight integration of Kasai et al. [Kasetal01] algorithm to visit all branching substrings of a text and Hui's [Hui92] color set size technique. • Toolbox: compressed suffix array, compressed LCP values, range minimum queries, searchable partial sums. Workshop on Compression, Santiago, Chile

Overview without compressed structures RMQ(LCP,8,14)=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 a a b a # a b a a a b # b b a b b # a b b a # 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # T: SA: LCP: Workshop on Compression, Santiago, Chile

Right-most path of suffix tree a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Suffixes-insertion algorithm a b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Maintain only the right-most path a • Once a node is popped,its subtree is ready, and all statistics for the substring ending to the node can be reported b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Hui's algorithm • Store at each node v of suffix tree • the values: • S[v]: number of leaves in the subtree of v, and • C[v]: number of dublicateoccurrences of the substring ending at node v. a a S[v]=3 C[v]=1 S[v]-C[v] tells how many different documents there are in the subtree of v. AKA S[v]-C[v] defines the frequency of the substring ending at node v. D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Making it all space-efficient [1/5] • Right-most path is kept in a specialstack: • Relative string depths are coded using Elias codes. • Takes O(n) bits. • Allows constant time pop/push. a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Making it all space-efficient [2/5] • Preliminary counter S[v] values along the right-most path are encoded identically as the stack. • Once a node v popped its S[v] value is final and this value is added to its parent. • O(n) bits with constant time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Making it all space-efficient [3/5] • Preliminary counter C[v] values along the right-most path are encoded using a dynamic searchable partial sumsstructure. • Once a node v popped its C[v] value is final and this value is added to its parent. • O(n) bits with O(log n) time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

Making it all space-efficient [4/5] • Table D encodes document numbers where suffixes belong to in lex. order. • Predecessor-query on D gives the previous occurrence inside the same document. • RMQ-between the two occurrences gives the string depth where the C[v] counter should be incremented. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile

Making it all space-efficient [5/5] • Table D does not need to be stored as predecessors can be updated "on-the-fly" using an array pred[1..d]. • Compressed suffix array supportsaccess in O(logε n) time and takes O(n log |Σ|) bits. • A bit-vector B[1,n] marks the document boundaries in the text, so that rank(B,SA[i])=D[i]. • LCP and RMQ structures each take2n(1+o(1)) bits [HS02,FH07]. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile

Extensions • This presentation only sketched how to compute the frequency values inside one document collection. In addition, • the computation is easy to adjust to report patterns occurring frequently in one document collection and infrequently in the other; • the computation gives a space-efficient construction algorithm for Sadakane's scheme of stroring the frequency values [Sad07]; and • other compressed text indexes can be plugged in to obtain other space/time tradeoffs. Workshop on Compression, Santiago, Chile

Epilogue • Thanks to the discussions with Luis Russo after the workshop, we were able to improve the space from O(n log d) to O(d log n). • The presentation has been changed accordingly. Workshop on Compression, Santiago, Chile

References [FHK06] Johannes Fischer, Volker Heun, Stefan Kramer: Optimal String Mining under Frequency Constraints, Proc. PKDD'06, LNAI 4213, pages 139-150, 2006. [FH07] Johannes Fischer, Volker Heun: A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007. [FMV07] Johannes Fischer, Veli Mäkinen, Niko Välimäki: Space-efficient String Mining under Frequency Constraints. Submitted. [HS02] Wing-Kai Hon, Kunihiko Sadakane: Space-Economical Algorithms for Finding Maximal Unique Matches. In Proc. CPM 2002, LNCS 2373, pages 144-152, 2002. [Hui92] Lucas Hui: Color Set Size Problem with Application to String Matching. In Proc. CPM 1992, LNCS 644, pages 230-243, 1992. [Kasetal01] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, Kunsoo Park: Linear-Time Longest- Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. CPM 2001, LNCS 2089, pages 181-192, 2001. [Sad07] Kunihiko Sadakane: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1): 12-22 (2007) Workshop on Compression, Santiago, Chile

Space-Efficient String Mining under Frequency Constraints

Space-Efficient String Mining under Frequency Constraints

Presentation Transcript

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space Efficient Alignment Algorithms

Space-Efficient Gradual Typing

Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Mining Constraints for Artful Processes

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

Ontology Evolution Under Semantic Constraints

HAMPI A Solver for String Constraints

Modeling Regular Replacement for String Constraints Solving

Under ground mining

Efficient Frequency Spectrum Utilization

Decision Procedures for String Constraints

Mining Association Rules with Constraints

Indexing Text Data under Space Constraints

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem.

NEMA Constraints on Voltage and frequency

An Efficient Technology Mapping Algorithm Targeting Routing Congestion Under Delay Constraints

Space Mining Market