Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang1, Chuan Xiao1, Xuemin Lin1 and Chengqi Zhang2 1 University of New South Wales and NICTA 2 University of Technology, Sydney

Named Entity Recognition • Dictionary-based NER Dictionary of Entities Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist ... Documents 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst. 2

Approximate Entity Extraction • What if data are not cleaned or standardized? • due to typos, multiple representations, etc. • Example – multiple representations • al qaeda • al qaida • al-qaeda • al-qa’ida • Using similarity measures • token-based measures: jaccard • e.g. • x = {al, qaeda}, y = {al, qaida} • J(x, y) = 1/3 = 0.33 • If we set the threshold as 0.33, • it works well for entities with several tokens, • but, {al, qaeda} will match {al, gore} ! match the same entity! 3

Using Edit Distance Constraints • Using string-based measures • edit-distance • Problem Definition • Given a document R and a dictionary E of entities, the task of approximate entity extraction with edit distance threshold dis to find all sub-strings in R such that they are within edit distance d from one of the entities in E. • { R[i .. j], E | k, ed(R[i .. j], Ek)  d } E 4

Previous Approaches • q-gram based method • count filtering • at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 –q*d • position filtering • positions of common q-grams should be within d • length filtering • | len(s)-len(t) | d • Steps • index the q-grams for the entities • probe index for the q-grams of each sub-string (query) of the document  form candidates • verify the candidates Example: q = 3 a Rhode_Island Rho hod ode de_ e_I _Is Isl sla lan and at most q*d q-grams are destroyed 5

Drawbacks of q-gram Based Methods • entities are short • we have to use small q to ensure the lower bound of matching q-grams is positive • short q-grams result in poor performance • short q-grams are frequent  long inverted lists • the lower bound is low for short entities  large candidate size • It has to try all the queries with length from Lmin – d to Lmax + d at every starting position. Document 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. Dictionary(Lmin=9, Lmax=43) 1 physicist 2 mathematician 3 Philosophiæ Naturalis Principia Mathematica 6

FastSS Algorithm [T. Bocek et. al. 2007] • Basic Idea – Neighborhood Generation • generate the variants for each entity and query by enumerating edit operations at any possible position • Steps • enumerate by at most d deletions for each entity • resulting strings are called d-variant family, inserted into inverted index • generate d-variant family for each query, probe the index to form candidates, and then verify them • Example, d = 1 • e = qaeda • q = qaida • Ve = {qaeda, aeda, qeda, qada, qaea, qaed} • Vq = {qaida, aida, qida, qada, qaia, qaid} • Problem • the size of d-variant family for each entity (query) is O(|s|d) • too many variants when entities are long or d is large! 7

Partitioning Scheme • How to reduce the number of variants? • immediate solution: divide an entity (query) into several partitions • generate d-variants within each partition only  guarantee not to miss any result • still too many variants? • pigeon-hole principle • If we consider shifting and scaling, there exists an entity partition and a query partition such that their edit distance is within 1  generate 1-variant family for each partition • divide each entity (query) into k = ceil[(d+1)/2] partitions 8

Partitioning Scheme • divide each entity (query) into k = ceil[(d+1)/2] partitions • shift within the range of [-d, d] • scale within the range of [-2, 2] (it can be proved 2 is enough) • shifting an scaling are only needed on entities • special cases • first partition: only need to consider scaling within [-2, 2] • last partition: only need to consider same amount of shifting and scaling within [-d, d] d d always start from the first character always end with the last character 2 2

Partitioning Scheme - Example • Example, d = 3 • e = abcdefgh • q = axxbcdefgyh • Partitioning • k = 2 • Pe = {<ab,1>; <abc,1>, <abcd,1>; <abcde,1>; <abcdef,1>; <bcdefgh,2>; <cdefgh,2>; <defgh,2>; <efgh,2>; <fgh,2>; <gh,2>; <h,2>} • Pq = {<axxbc,1>;<defgyh,2>} • Generating 1-variants • V{defgh} and V{defgyh} share a common variant ‘defgh’, so this candidate will be identified represented in the form of <str, partition_id> 10

Prefix Pruning • What if a partition is still quite long? • still many 1-variants • solution: generate 1-variant family on prefix only! • Prefix Pruning • If a partition is longer than a threshold l, we only generate 1-variant family on its l-prefix. • Example, l = 5 • P = abcdefg • generate 1-variant family on its 5-prefix • P[1 .. 5] = abcde • Vp[1 .. 5] = {abcde, bcde, acde, abce, abcd} • space complexity - # of variants generated • FastSS: O(|s|d) • after partitioning and prefix pruning: O(l * d2) 11

NGPP Algorithm • Neighborhood Generation + Partitioning + Prefix • Balance between variant size and selectivity • different schemes to deal with short and long entities • Index short and long entities • short: for entities which are shorter than k*l+d, we index d-variant family on its l-prefix (prefix pruning only) • long: for entities which are no shorter than k*l, we first divide them into k partitions, and index 1-variant family on the l-prefix of the partitions (partitioning + prefix pruning) • Scan documents • scan for each starting position • enumerate the query length from Lmin–d to l • generate its d-variant family, search for short entities • generate its 1-variant family, search for long entities 12

NGPP Example genenrate 1-variant familiy pr pro prov provi provid vidence idence dence ence nce • d = 2, l = 4  short < 10, long >= 8 • Entity • e1 = ‘Providence’ (long) • e2 = ‘capital’ (short) • Document • Prowidnce is the kaepital of Rhode Island. 1-variant match Prow kaep rowi owid … … genenrate d-variant familiy capital d-variant match e1 Providence e2 capital 13

Experiment Settings • Algorithms • NGPP • FastSS • q-gram based method • Measure • number of variants, candidate size, running time • Dataset 14

Experiment Results • NGPP vs FastSS • DBLP; d = 2 15

Experiment Results • NGPP vs q-gram based method • DBLP; d = 1, 2, 3 Candidate Size Running Time

Conclusion • Contributions • an efficient algorithm for approximate entity extraction with edit distance constraints • based on neighborhood generation • two techniques to reduce the number of variants generated, as well as running time • partitioning • prefix pruning • Future work • approximate multiple pattern matching • other similarity measures, e.g., the function used in DNA/protein sequence alignment

Thank you! Questions? 18

Related Work • neighborhood generation approaches • E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, 1994. • T. Bocek, E. Hunt, B. Stiller. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. • q-gram based approaches • L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. • C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008. • alternative: use vgrams instead of q-grams • C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, 2007. • X. Yang, B. Wang, and C. Li. Cost-based variable length gram selection for string collections to support approximate queries efficiently. In SIGMOD, 2008. 19

Efficient Approximate Entity Extraction with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Presentation Transcript

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Top-k String Similarity Search with Edit-Distance Constraints

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

Minimum Edit Distance

Efficient Approximation of Edit Distance

Minimum Edit Distance

Answer Extraction as Sequence Tagging with Tree Edit Distance

String Edit Distance Matching Problem With Moves

Approximate Distance Oracles

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

L arge-scale Similarity Join with Edit-distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Named Entity Extraction

Minimum Edit Distance

Edit Distance