1 / 27

E ntity R esolution in N etwork D ata

E ntity R esolution in N etwork D ata. Lise Getoor University of Maryland, College Park. Entity Resolution . The Problem The Algorithms Graph-based Clustering (GBC) Probabilistic Model (LDA-ER) The Tool The Big Picture. The Entity Resolution Problem. James Smith. John Smith.

konala
Télécharger la présentation

E ntity R esolution in N etwork D ata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Resolution in Network Data Lise Getoor University of Maryland, College Park NetSci07 May 24, 2007

  2. Entity Resolution • The Problem • The Algorithms • Graph-based Clustering (GBC) • Probabilistic Model (LDA-ER) • The Tool • The Big Picture

  3. The Entity Resolution Problem James Smith John Smith “John Smith” “Jim Smith” “J Smith” “James Smith” “Jon Smith” Jonathan Smith “J Smith” “Jonthan Smith” Issues: • Identification • Disambiguation

  4. before after InfoVis Co-Author Network Fragment

  5. Entity Resolution in Networks • References not observed independently • Links between references indicate relations between the entities • Co-author relations for bibliographic data • To, cc: lists for email • Use relations to improve identification and disambiguation

  6. Relational Identification Very similar names. Added evidence from shared co-authors

  7. Relational Disambiguation Very similar names but no shared collaborators

  8. Collective Entity Resolution One resolution provides evidence for another => joint resolution

  9. Entity Resolution • The Problem • The Algorithms • Relational Clustering (RC-ER) • Bhattacharya and Getoor, DMKD’04, Wiley’06, TKDD’07 • Probabilistic Model (LDA-ER) • Experimental Evaluation • The Tool • The Big Picture

  10. Objective Function • Minimize: weight for attributes similarity of attributes weight for relations 1 iff relational edge exists between ci and cj • Greedy clustering algorithm: merge cluster pair with max reduction in objective function Similarity of attributes Common cluster neighborhood

  11. Relational Clustering Algorithm • Find similar references using ‘blocking’ • Bootstrap clusters using attributes and relations • Compute similarities for cluster pairs and insert into priority queue • Repeat until priority queue is empty • Find ‘closest’ cluster pair • Stop if similarity below threshold • Merge to create new cluster • Update similarity for ‘related’ clusters • O(n k log n) algorithm w/ efficient implementation CODE AND DATA AND DATA GENERATOR AVAILABLE HERE: http://www.cs.umd.edu/~indrajit/ER/

  12. Entity Resolution • The Problem • Relational Entity Resolution • Algorithms • Relational Clustering (RC-ER) • Probabilistic Model (LDA-ER) • SIAM SDM’06, Best Paper Award • Experimental Evaluation • Query-time Entity Resolution

  13. Probabilistic Generative Model for Collective Entity Resolution • Model how references co-occur in data • Generation of references from entities • Relationships between underlying entities • Groups of entities instead of pair-wise relations

  14. α P θ R z β T a Φ A r V LDA-ER Model • Entity label aand group label z for each reference r • Θ: ‘mixture’ of groups for each co-occurrence • Φz:multinomial for choosing entity a for each group z • Va: multinomial for choosing reference r from entity a • Dirichlet priors with αand β

  15. Approx. Inference Using Gibbs Sampling • Conditional distribution over labels for each ref. • Sample next labels from conditional distribution • Repeat over all references until convergence • Converges to most likely number of entities

  16. Faster Inference: Split-Merge Sampling • Naïve strategy reassigns references individually • Alternative: allow entities to merge or split • For entity ai, find conditional distribution for • Merging with existing entity aj • Splitting back to last merged entities • Remaining unchanged • Sample next state for ai from distribution • O(n g + e) time per iteration compared to O(n g + n e)

  17. Entity Resolution • The Problem • Relational Entity Resolution • Algorithms • Relational Clustering (RC-ER) • Probabilistic Model (LDA-ER) • Experimental Evaluation • Query-time Entity Resolution • ER User Interface

  18. Evaluation Datasets • CiteSeer • 1,504 citations to machine learning papers (Lawrence et al.) • 2,892 references to 1,165 author entities • arXiv • 29,555 publications from High Energy Physics (KDD Cup’03) • 58,515 refs to 9,200 authors • Elsevier BioBase • 156,156 Biology papers (IBM KDD Challenge ’05) • 831,991 author refs • Keywords, topic classifications, language, country and affiliation of corresponding author, etc

  19. Baselines • A: Pair-wise duplicate decisions w/ attributes only • Names:Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler • Other textual attributes:TF-IDF • A*: Transitive closure over A • A+N: Add attribute similarity of co-occurring refs • A+N*: Transitive closure over A+N • Evaluate pair-wise decisions over references • F1-measure (harmonic mean of precision and recall)

  20. ER Evaluation • RC-ER & LDA-ER outperform baselines in all datasets • Collective resolution better than naïve relational resolution • CiteSeer: Near perfect resolution; 22% error reduction • arXiv: 6,500 additional correct resolutions; 20% err. red. • BioBase: Biggest improvement over baselines

  21. Trends in Synthetic Data Bigger improvement with • bigger % of ambiguous refs • more refs per co-occurrence • more neighbors per entity

  22. Entity Resolution • The Problem • Relational Entity Resolution • The Algorithms • The Tool • H. Kang, M. Bilgic, L. Licamele, B. Shneiderman VAST06, IV07 • The Big Picture

  23. D-Dupe: An Interactive Tool for Entity Resolution http://www.cs.umd.edu/projects/linqs/ddupe Novel combination of network visualization and statistical relational models well-suited to the visual analytic task at hand

  24. Entity Resolution • The Problem • Relational Entity Resolution • The Algorithms • The Tool • The Big Picture

  25. Putting Everything together….

  26. Summary • In reality, want to be able to flexibly combine node, edge and graph-based inferences: • While there are important pitfalls to take into account (confidence and privacy), there are many potential benefits and payoffs Entity Resolution + Link Prediction + Collective Classification = Graph Identification

  27. Thanks! http:www.cs.umd.edu/~getoor Work sponsored by the National Science Foundation, Google, KDD program and National Geospatial Agency

More Related