Relational Clustering for Entity Resolution Queries

Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park

The Entity Resolution Problem Abdulla Ansari Chih Chen WeiWei Wang P1:“A mouse immunity model”,W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”,W.Wang, A.Ansari P3: “Measuring protein-boundfluxetine”,L.Li, C.Chen, W.Wang P4: “Autoimmunity in biliary cirrhosis”,W.W.Wang, A.Ansari Wenyi Wang Liyuan Li Chien-Te Chen • Discover the domain entities • Map each reference to an entity

Query-time ER: Motivation • Most publicly available databases do not have resolved entities • PubMed, CiteSeer have many unresolved authors • Millions of queries everyday require resolved entities directly or indirectly • “I am looking for all papers by Stuart Russell” • How do we address this problem? • Leave the burden on the user to do the resolution • Ask owners to ‘clean’ their databases • Develop techniques for query-time resolution

Entity Resolution Queries • Disambiguation Query • Among all papers with ‘W Wang’ as author, find those written by WeiWei Wang P1:“A mouse immunity model”,W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”,W.Wang, A.Ansari P3: “Measuring protein-boundfluxetine”,L.Li, C.Chen, W.Wang P1:“A mouse immunity model”,W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”,W.Wang, A.Ansari P4: “Autoimmunity in biliary cirrhosis”,W.W.Wang, A.Ansari • Resolution Query • Do disambiguation • Also retrieve papers by WeiWei Wang with a different author name, e.g. ‘W W Wang’ etc

Query-time ER using Relations • Simple approach for resolving queries • Use attributes • Quick but not accurate • Use best techniques available • Collective resolution using relationships • How can localize collective resolution? • Two-phase collective resolution for query • Extract minimal set of relevant records • Collective resolution on extracted records

Cut-based Evaluation of Relational Clustering • Vertices embedded in attribute space • Additional (hyper)edges represent relationships C3 C3 C1 C1 C2 C2 C4 C4 • Good separation of attributes • Many cluster-cluster relationships • C1-C3, C1-C4, C2-C4 • Worse in terms of attributes • Fewer cluster-cluster relationships • C1-C3, C2-C4

A Cut-based Objective Function weight for attributes similarity of attributes weight for relations 1 iff relational edge exists between ci and cj compatibility of ci and cj • Greedy clustering algorithm: merge cluster pair with max reduction in objective function • Similarity of attributes • Jaro, Levenstein; TF-IDF • Common cluster neighborhood • Jaccard works better than intersection

Extracting Relevant Records Name expansion Name expansion Hyper-edge expansion Query Level 0 Level 1 Level 2 P4: A Ansari P2: A Ansari P1: A Ansari P1: C Chen P3: C Chen P3: L Li P: A Ansari P: A Ansari P: C Chen P: C Chen P: L Li P: L Li W Wang P4: W W Wang P1: W Wang P2: W Wang P3: W Wang Start with query name or record Alternate between • Name expansion: For any relevant record, include other records with that name • Hyper-edge Expansion: For any relevant record, include other related records Terminate at some depth k

Adaptive Expansion for a Query • Too many records with unconstrained expansion • Adaptively select records based on ‘ambiguity’ • ‘Chen’ is more ambiguous than ‘Ansari’ • Adaptive Name Expansion • Expand the more ambiguous records • They need extra evidence • Adaptive Hyper-edge expansion • Add fewer ambiguous records • They lead to imprecision

Unsupervised Estimation of Ambiguity • Probability of multiple entities sharing an attribute value • Estimate ambiguity of one single valued attribute (A1=a) using another (A2) • Count number of different values of A2 observed for records having A1=a • e.g. #different first initials for last-name ‘Smith’ • Estimate improves with more independent attributes

Evaluation Datasets • arXiv High Energy Physics • 29,555 publications, 58,515 refs to 9,200 authors • Queries: All ambiguous names (75 in total) • True authors per name: 2 to 11 (avg. is 2.4) • Elsevier BioBase • 156,156 publications, 831,991 author refs • Keywords, topic classifications, language, country and affiliation of corresponding author, etc • Queries: 100 most frequent names • True authors per name: 1 to 100 (avg. is 32)

Growth Rate of Relevant Records and Query Processing Time Number of relevant references grows rapidly with expansion depth RC-ER is fast but not good enough for query-time resolution

Query-time ER Results Unconstrained expansion • Collective resolution more accurate • Accuracy improves beyond depth 1 A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure Adaptive expansion • Minimal loss in accuracy • Dramatic reduction in query processing time AX-2: adaptive expansion at depths 2 and beyond AX-1: adaptive expansion even at depth 1

Conclusions • Query-centric entity resolution • Cut-based evaluation of relational clustering • Adaptive selection of relevant references for a query • Resolution at query-time with minimal loss in accuracy Future Directions • Spectral algorithm for relational clustering • Stronger coupling between extraction and resolution • Localized resolution for incoming records

References • "Query-Time Entity Resolution", Indrajit Bhattacharya, Louis Licamele and Lise Getoor, ACM SIGKDD, 2006 • "A Latent Dirichlet Model for Unsupervised Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIAM Data Mining, 2006 • "Entity Resolution in Graphs", Indrajit Bhattacharya and Lise Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and Diane J. Cook, Editors, Wiley, 2006 (to appear). • "Relational Clustering for Multi-type Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIGKDD Workshop on Multi Relational Data Mining (MRDM), 2005

Relational Clustering for Entity Resolution Queries