Efficient Dictionary Matching with Suffix Trees and Sampling Techniques

Compressed Index for Dictionary Matching WK Hon(NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

Outline • Dictionary Matching Problem • Summary of Results • Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities • Open Problems

Dictionary Matching • Input: A set of d short patterns, { P1, P2, …, Pd } of total length n • Problem: Preprocess the patterns, and create an index so that: on receiving any textT, we can report for each Pj, all positions in T where it occurs

Dictionary Matching • Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T s = size of alphabet of T and patterns occ = total occurrences in search result

optimal e= constant in (0,1) |patterns| + o(n log s) Summary of Results

a v t e e t h c h a a i r t v t e Patricia trie for { ate, chair, chat, hat, have, vet } Existing Solution I: Patricia Trie • Compact trie storing all d patterns

Existing Solution I: Patricia Trie • Advantage: Space: |patterns| + O( d log n ) bits  Very small overhead in addition to the input patterns

Existing Solution I: Patricia Trie Searching Strategy: For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found • Disadvantage: Searching: worst-case O(|T|n + occ) time

v a e i r t c t h i r h a t v r $ e e a $ e i v $ e e r i t t r t $ suffix tree for { ate, chair, chat, hat, have, vet } Existing Solution II: Suffix Tree • Compact trie storing all suffixes of all d patterns

Matching Time = O(|T|) Existing Solution II: Suffix Tree • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found Searching: worst-case O(|T| + occ) time

Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits  could be much larger than O( n log s ), the space for|patterns|

no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching Our Solution

v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet } Our Solution: Sampling • Store one suffix for every a suffixes

irregularities Our Solution: Sampling • Store one suffix for every a suffixes v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet }

Need to handle irregularities Matching time = O(|T|) despite irregularities Our Solution: Sampling • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found

Y-fast trie When a = logsn Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log s ) bits

Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: |patterns| + o(n log s) bits

Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: nHk + o(n log s) bits FerVen 07

Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nHk-type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?

Efficient Dictionary Matching with Suffix Trees and Sampling Techniques

Efficient Dictionary Matching with Suffix Trees and Sampling Techniques

Presentation Transcript

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

A Unifying Framework for Compressed Pattern Matching

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic

Pattern Matching on Compressed Texts II

Dictionary Matching with One Gap

Token-based dictionary pattern matching for text analytics

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic

Accelerating Multi-Patterns Matching on Compressed HTTP Traffic

Shift-based Pattern Matching for Compressed Web Traffic

CSE182-L5: Scoring matrices Dictionary Matching

CSE182-L5: Scoring matrices Dictionary Matching

Approximate Matching of Run-Length Compressed Strings

String Matching in Lempel-Ziv Compressed Strings

Multiple Pattern Matching in LZW Compressed Text

Compressed Index for a Dynamic Collection of Texts

CSE182-L4: Scoring matrices, Dictionary Matching

The SBC-Tree: An Index for Run-Length Compressed Sequences

CSE182-L4: Scoring matrices, Dictionary Matching

CSE182-L5: Scoring matrices Dictionary Matching

CSE182-L4: Scoring matrices, Dictionary Matching