Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

Introduction: An Example • A dictionary of strings we are interested in • E.g. product names, postal addresses… • We are going to locate their “approximate apparences” in a series of documents. • See the meaning of “approximate apparence” in the following example:

Problem Definition • Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r ∈R, and Similarity (r, m) ≥δ(or Distance(r, m) ≤k). • Here we call r a piece of evidence for m. • Similarity() is a function measuring the similarity of two strings • Strings are viewed as sets of tokens (words) • An example for Sim(): Jaccard similarity:

Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion

Why pre-pruning is needed • We need spot evidence to decide whether a substring m should be extracted • Simple verification on all dictionary strings may be inefficient • Pre-pruning and post-verifying is beneficial • But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors?

More(less) filtration time Strong(weak) filtration power Overall performance =Tf+Tv????? Less(more) verification time Fewer(more) candidates The issue of compromise comes again • Balance between the two stages should be reached:

K-signature scheme • K-signature scheme • Proposed by Chakrabarti et al. (SIGMOD 2008) • Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) • Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m • K is a parameter for filtration power tuning • Potential evidence loss • A counter-example found when k=3 • We tried and only proved that it works for k=1 and k=∞

Inverted Signature-based Hashtable • Proposed by Chakrabarti et al. (SIGMOD 2008) • Each dictionary string encoded into a solid 0-1 matrix • An ‘1’ for each occurrence of a <token,sig-token> tuple (‘1’- rectangle) • Bitwise-or all solid matrices to get the matrix of R • Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. • Formalized into an NPC problem • Solution causes too weak filtering power

Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

Too strict ! Proved by us Our proposed theorem • If Sim(m,r) ≥δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ(m) wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) } • So the threshold does not remain constant • involves unknown evidence • Our solution: Use inverted lists to count sig-token overlappings. • Note that sig-tokens usually have low document frequency (e.g. IDF as weights)

5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 1 1 2 1 2 2 3 3 Signature-based Inverted Lists • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. • E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slrcamera”, r3=“canon slr camera”}. • wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).

2 3 2 1 2 Qualified! Filtration by SIL • Using an array called “accumulator” to compute the overlapped sig weight wt(Sig(m)∩Sig(r)) • E.g. m=“canon eos digital camera”, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 Accumulator 3 1 1 2.0 9.0 0 2.0

EvITER: Progressive Computation • Recall we are checking all substrings • Some of them are quite similar, indicating that they share duplicate computation • An intuition: if m have potential evidence r, then m t is very likely to match r • Formally we proved that • Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} • We have ES(m t)ES(m)∪list[t]

List[t] ES(m) … lens, 3.0 … {r1} 22 53 Example • Docoment M: m t “…. cannon eos digital camera lens…” • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”

Flow of Evidence • EvITER for “Evidence ITERATION” …

The Static Threshold Problem • How does this index work so far? • -“Get ready forδ=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“Sorry, please wait another 30min for index regeneration…” • -“:-(”

The Static Threshold Problem • This One Seems Better • -“Get ready forδ>=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“…Extraction complete.” • “:-)”

Supporting Dynamic Thresholds • An Observation • When δ descends, a string r’s tokens fall into Sig(r) one by one, in the order of their weight ranking. • I.e. any node <sig-token, rid> is “active” when δ is below certain “threshold” u<sig-token, rid>. • We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value. • For any given δ, we only need retrieve a prefix of each list to get all “active nodes”

Experimental Datasets • DBLP: 274,788 Paper titles • 1,838,973 URLs

Balance should be reached • Recall our two stages of filtration and verification

Performance (DBLP)

Conclusion • Our method causes no false negatives • Our method achieves a good balance between the two phases of filtration and verification • We also propose EvITER to eliminate duplicate computation • Our method has both effective & efficient performance

Thank You ! Q&A

References • [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918-929, 2006. • [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. • [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. • [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. • [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. • [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.

References • [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008. • [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. • [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. • [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. • [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001. • [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995. • [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Presentation Transcript

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Approximate Algorithms (chap. 35)

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Algorithms for Matching

Routing Algorithms using Random Walks with Tabu Lists

Automated Signature Extraction for High Volume Attacks

Bayesian Networks: Sampling Algorithms for Approximate Inference

Efficient Merging and Filtering Algorithms for Approximate String Searches

Two Approximate Algorithms for Belief Updating

Using the Inverted Classroom

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Algorithms for Motif Search

Approximate POMDPs using Point-based Value Iteration

Efficient Fourier-Based Algorithms for Time-Periodic Unsteady Problems

Algorithms for Efficient Collaborative Filtering

Efficient Approximate Entity Extraction with Edit Distance Constraints

Filter Algorithms for Approximate String Matching

Efficient Algorithms for Motif Search

Two Approximate Algorithms for Belief Updating