Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Northeastern University, China Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li

Approximate selection queries Schwarrzenger Query errors: • Limited knowledge about data • Typos • Limited input device (cell phone) input Data errors • Typos • Web data • OCR Similarity functions: • Edit distance • Jaccard • Cosine • … Applications • Spellchecking • Query relaxation • …

Performance is a big issue • Answer queries interactively • Many queries on a server

Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

q-grams b i n go n 2-grams

id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going q-gram inverted lists 2-grams

id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going Query processing • ED(bingon, ?)≤1 # of common grams >= 3 2-grams

n 1 b t n g o i n n n n n n 4 6 2 7 5 3 o n t i n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # n n n n n n n n n n n n n 19 20 24 21 22 25 23 26 27 31 29 30 28 # # n n 32 33 VGRAM: variable-length grams[VLDB07] [2,3]-gram dictionary b i n go n

n 1 b t n g o i n n n n n n 4 6 2 b i n g o n 7 5 3 b i n g o n o i n t n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # # of common grams >= 3 n n n n n n n n n n n n n 19 20 21 22 24 25 23 26 27 29 30 31 28 # # n n 32 33 Adopting VGRAM in algorithms grams string VGRAM lower bound gram dictionary

Contributions of this study • Tightening lower bounds using dynamic programming • Cost-based quantitative approach • Analyze and estimate query performance when adding each gram • Automatically find high-quality grams High quality gram Gram dictionary String collection

Calculating lower bound Fixed length (q) b i i n d i n g ed(s1,s2) <= k, then # of common grams >= # of s1 grams –k *q

Calculating lower bound Variable lengths 2 2 2 1 1 3 3 1 b i i n d i n g lower bound =# of grams of s1 – NAG(s1,k)

Too pessimistic? • k-Max: Summation of k largest values NAG(s,2)=3+3=6 2 2 2 1 1 3 3 1 b i i n d i n g

Tightening lower bound • Dynamic programming: tightening NAG(s,k) • Subproblems: NAG(s[1,j], i) opi String s j 1

opi opi-1 Dynamic programming • Recurrence function B[ j ] opi String s j 1

Dynamic programming 2 2 2 1 1 3 3 1 b i i n d i n g k=0 NAG vector k=1 k=2

Effects on inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc string --abc-- --ab-- --bc--

Effects on query performance • Decrease query’s inverted list • Change lower bound • Change # of candidates

Effects on query’s inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q • Adding a new gram abc will not change or decrease the query’s inverted lists

Effects on lower bound • Query: Q, ED(Q, ?)≤1 Query Q Query Q

Effects on # of candidates • Change lower bound  change # of candidates Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q

Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

Construct a gram dictionary[VLDB07] qmin=2 qmax=4

Cost-base construction qmin=2

Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

Data sets Environment: GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory

Effect of Tightening Lower Bound 1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, qmin = 4

Comparison with algorithm Prune [VLDB07] Dataset: 1M article titles Prune: qmin=5, qmax=7, T=2000, LargeFirst policy GramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

Choosing qmin Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%

Conclusions • Tightening lower bound • Dynamic programming • Analysis of adding a gram affects • Index structure • Performance of queries • Efficient algorithm • Automatically generating a high-quality gram dictionary

Thank you Questions or Comments?

Related work • Approximate String Matching • q-Grams, q-Samples • Inside DBMS • Substring matching • Set similarity join • Estimation • Selectivity of SQL LIKE substring queries • Approximate string answers

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Presentation Transcript

Efficient Approximate Search on String Collections Part II

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Approximate String Matching

The Flamingo Software Package on Approximate String Queries

Supporting Location-Based Approximate-Keyword Queries

Variable Length Deduplication

Approximate range selection queries in P2P systems

Variable Length Subnetting

Efficient Approximate Search on String Collections Part I

Rules for Approximate String Matching

Answering Approximate Queries Efficiently

LCA -Based Selection for XML Document Collections

Variable Length Coding

APPROXIMATE COST

Variable Length Coding

Approximate Selection Queries over Imprecise Data

Variable Length Subnetting

Filter Algorithms for Approximate String Matching

Approximate String Matching

Efficient Approximate Search on String Collections Part II

Answering Approximate Queries Efficiently