Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists
310 likes | 424 Vues
This paper explores efficient algorithms for approximate member extraction from a dictionary of strings using signature-based inverted lists. The focus is on a filtration and verification framework that utilizes a k-signature scheme for effective pre-pruning and post-verification. The goal is to identify proper substrings in input documents that closely match dictionary entries based on a defined similarity threshold. Key contributions include the improved filtration conditions and dynamic thresholds to optimize performance, helping to balance speed and filtering power in substring extraction tasks.
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists
E N D
Presentation Transcript
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China
Introduction: An Example • A dictionary of strings we are interested in • E.g. product names, postal addresses… • We are going to locate their “approximate apparences” in a series of documents. • See the meaning of “approximate apparence” in the following example:
Problem Definition • Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r ∈R, and Similarity (r, m) ≥δ(or Distance(r, m) ≤k). • Here we call r a piece of evidence for m. • Similarity() is a function measuring the similarity of two strings • Strings are viewed as sets of tokens (words) • An example for Sim(): Jaccard similarity:
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
Why pre-pruning is needed • We need spot evidence to decide whether a substring m should be extracted • Simple verification on all dictionary strings may be inefficient • Pre-pruning and post-verifying is beneficial • But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors?
More(less) filtration time Strong(weak) filtration power Overall performance =Tf+Tv????? Less(more) verification time Fewer(more) candidates The issue of compromise comes again • Balance between the two stages should be reached:
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
K-signature scheme • K-signature scheme • Proposed by Chakrabarti et al. (SIGMOD 2008) • Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) • Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m • K is a parameter for filtration power tuning • Potential evidence loss • A counter-example found when k=3 • We tried and only proved that it works for k=1 and k=∞
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
Inverted Signature-based Hashtable • Proposed by Chakrabarti et al. (SIGMOD 2008) • Each dictionary string encoded into a solid 0-1 matrix • An ‘1’ for each occurrence of a <token,sig-token> tuple (‘1’- rectangle) • Bitwise-or all solid matrices to get the matrix of R • Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. • Formalized into an NPC problem • Solution causes too weak filtering power
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
Too strict ! Proved by us Our proposed theorem • If Sim(m,r) ≥δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ(m) wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) } • So the threshold does not remain constant • involves unknown evidence • Our solution: Use inverted lists to count sig-token overlappings. • Note that sig-tokens usually have low document frequency (e.g. IDF as weights)
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 1 1 2 1 2 2 3 3 Signature-based Inverted Lists • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. • E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slrcamera”, r3=“canon slr camera”}. • wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).
2 3 2 1 2 Qualified! Filtration by SIL • Using an array called “accumulator” to compute the overlapped sig weight wt(Sig(m)∩Sig(r)) • E.g. m=“canon eos digital camera”, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 Accumulator 3 1 1 2.0 9.0 0 2.0
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
EvITER: Progressive Computation • Recall we are checking all substrings • Some of them are quite similar, indicating that they share duplicate computation • An intuition: if m have potential evidence r, then m t is very likely to match r • Formally we proved that • Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} • We have ES(m t)ES(m)∪list[t]
List[t] ES(m) … lens, 3.0 … {r1} 22 53 Example • Docoment M: m t “…. cannon eos digital camera lens…” • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”
Flow of Evidence • EvITER for “Evidence ITERATION” …
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
The Static Threshold Problem • How does this index work so far? • -“Get ready forδ=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“Sorry, please wait another 30min for index regeneration…” • -“:-(”
The Static Threshold Problem • This One Seems Better • -“Get ready forδ>=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“…Extraction complete.” • “:-)”
Supporting Dynamic Thresholds • An Observation • When δ descends, a string r’s tokens fall into Sig(r) one by one, in the order of their weight ranking. • I.e. any node <sig-token, rid> is “active” when δ is below certain “threshold” u<sig-token, rid>. • We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value. • For any given δ, we only need retrieve a prefix of each list to get all “active nodes”
Experimental Datasets • DBLP: 274,788 Paper titles • 1,838,973 URLs
Balance should be reached • Recall our two stages of filtration and verification
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
Conclusion • Our method causes no false negatives • Our method achieves a good balance between the two phases of filtration and verification • We also propose EvITER to eliminate duplicate computation • Our method has both effective & efficient performance
Thank You ! Q&A
References • [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918-929, 2006. • [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. • [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. • [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. • [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. • [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
References • [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008. • [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. • [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. • [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. • [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001. • [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995. • [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.