Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008

Motivation • Selectivity estimation of approximate string matching queries • Applications • Misspelling correction/suggestion • Data integration and data cleaning • Query optimization (generating query plans)

Approximate String Matching • String similarity measures • Edit distance • Hamming distance • Jaccard similarity co-efficient • Edit distance • Minimum number of edit (insertion, deletion, replacement) operations to convert a string to the other

Short Identifying Substring • SIS by Chaudhuri, et al. [2] • String susually has a substring s’ that if an attribute value contains s, it almost always contains s’ • Thus, approximate selectivity of long string queries with their shorter substrings

Related Work • SEPIA [3] • Clusters similar strings • Selects a pivot for each cluster • Captures the edit distance distribution with histograms • For each query, visit all the clusters and estimate the number of strings within the distance threshold

Problem Statement • Given a query string sq and a bag of strings DB estimate the size of the answer set • Interested in low edit thresholds (1-3)

Basic Definitions • Q-gram • Any string of length q • N-gram table • Frequencies of all q-grams for q=1…N • Ans(sq,iDjImR) = set of strings s’ such that sq can be converted to s’ with i deletions, j insertions and m replacements • Ans(sq,k) = set of string s’ obtained from sq with exactly k edit operations

Examples • Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”} • Alphabet for extended Q-grams = • 3-gram table for “beau” contains frequencies for • 1-grams (b, e, a, u) • 2-grams (#b, be, ea, au, u$) • 3-grams (#be, bea, eau, au$ • Extended 3-gram table also contains frequencies for • For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$ • For 3-grams (?ea, #?e, ??$, etc.)

Replacement semi-lattice • Assume only replacements are allowed • E.g. Ans(“abcd”,2R) • Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd • Find value of | Ans(“abcd”,2R)| using • S1 = ab??, … , S6=??cd

Replacement semi-lattice (Cont.) Semi-lattice for Ans(“abcd”,2R) Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)|

General Formulas • Generalize the above idea to find |Ans(sq,kR)| • The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes • The general case for insertion can be very complex, only interested in at most 3 insertions

Estimate selectivity • General idea • group Ans(sq,k) by the length of the strings (l-k...l+k) • Estimate the size of each subset separately • Ans(“abcde”,2) • 5 subsets, having strings of size 3 to 7 • Length 3 is Ans(“abcde”,2D) • Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R) Lots of overlap

Estimate selectivity (Cont.) • Combined Approach • Obtain base strings for both sets • Remove redundant base strings • Ans(“abcde”,2R) generates “abc??” • Ans(“abcde”,1I1D) generates “abcd?” • “abc??” has all the strings in “abcd?” Remove “abcd?” from base strings

Estimate selectivity (cont.) • BasicEQ, for a given string length • Find the base strings (remove redundancies) • Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|) • This will generate new nodes in the hierarchy • Partition the nodes and estimate their frequencies • Add these estimated frequencies

Estimate selectivity (cont.) • Node Partitioning • Partition the nodes, so that every node q in a partition has the same coefficient Cq • Cq is the number of times q appears in all the intersections of base strings • For each partition find Cq and sum of frequencies of its nodes

Frequency Estimation • Estimate the frequency of an extended q-gram in the extended N-gram table • Maximal Overlap (MO) [4] • Finds the substring in the table that has the maximum overlap with sq • MAX approach • If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?” • MO+ • Find the substring with the minimum frequency • MM • Combination of MAX and MO+

Estimate selectivity (cont.) • BasicEQ is efficient if the general formulas are applicable • Propose OptEQ that adds two enhancements to BasicEQ • Approximates the co-efficient Cq but achieves a better performance • Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)

Experimetal Evaluation (method, NB, NE, PT)

Experimetal Evaluation

Experimetal Evaluation Space vs. Accuracy

Conclusions • Proposed OptEQ • Approximates coefficients of partitions • Groups semi-lattices to obtain scalability • More accurate than SEPIA • Exploits disk space to give higher precisions • MM and Max estimates give good results

References [1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007 [2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004 [3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005 [4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]