190 likes | 309 Vues
This paper addresses the dictionary matching problem, where a set of d short patterns is given, and we aim to preprocess these patterns to create an index for rapid searching within any input text T. We explore existing solutions like Patricia tries and suffix trees, comparing their performance across various metrics, including space and searching time. Our novel solution incorporates a simple sampling idea to handle irregularities in patterns while maintaining optimal matching time. We discuss the advantages of our approach, along with open problems and potential future work.
E N D
Compressed Index for Dictionary Matching WK Hon(NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Outline • Dictionary Matching Problem • Summary of Results • Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities • Open Problems
Dictionary Matching • Input: A set of d short patterns, { P1, P2, …, Pd } of total length n • Problem: Preprocess the patterns, and create an index so that: on receiving any textT, we can report for each Pj, all positions in T where it occurs
Dictionary Matching • Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T s = size of alphabet of T and patterns occ = total occurrences in search result
optimal e= constant in (0,1) |patterns| + o(n log s) Summary of Results
a v t e e t h c h a a i r t v t e Patricia trie for { ate, chair, chat, hat, have, vet } Existing Solution I: Patricia Trie • Compact trie storing all d patterns
Existing Solution I: Patricia Trie • Advantage: Space: |patterns| + O( d log n ) bits Very small overhead in addition to the input patterns
Existing Solution I: Patricia Trie Searching Strategy: For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found • Disadvantage: Searching: worst-case O(|T|n + occ) time
v a e i r t c t h i r h a t v r $ e e a $ e i v $ e e r i t t r t $ suffix tree for { ate, chair, chat, hat, have, vet } Existing Solution II: Suffix Tree • Compact trie storing all suffixes of all d patterns
Matching Time = O(|T|) Existing Solution II: Suffix Tree • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found Searching: worst-case O(|T| + occ) time
Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits could be much larger than O( n log s ), the space for|patterns|
no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching Our Solution
v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet } Our Solution: Sampling • Store one suffix for every a suffixes
irregularities Our Solution: Sampling • Store one suffix for every a suffixes v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet }
Need to handle irregularities Matching time = O(|T|) despite irregularities Our Solution: Sampling • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found
Y-fast trie When a = logsn Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log s ) bits
Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: |patterns| + o(n log s) bits
Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: nHk + o(n log s) bits FerVen 07
Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nHk-type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?