1 / 27

Self-tuning in Graph-Based Reference Disambiguation

Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part

Télécharger la présentation

Self-tuning in Graph-Based Reference Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine

  2. Overview • Intro to Data Cleaning • Entity resolution • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

  3. Data Cleaning Analysis on bad data leads to wrong conclusions DASFAA 2007, Bangkok, Thailand

  4. Example of the problem: CiteSeer top-K Suspicious entries • Lets go to DBLP website • which stores bibliographic entries of many CS authors • Lets check two people • “A. Gupta” • “L. Zhang” CiteSeer: the top-k most cited authors DBLP DBLP DASFAA 2007, Bangkok, Thailand

  5. Two Most Common Entity-Resolution Challenges Fuzzy lookup • reference disambiguation • match references to objects • list of all objects is given Fuzzy grouping • group together object repre-sentations, that correspond to the same object DASFAA 2007, Bangkok, Thailand

  6. Standard Approach to Entity Resolution DASFAA 2007, Bangkok, Thailand

  7. Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

  8. RelDC Framework DASFAA 2007, Bangkok, Thailand

  9. RelDC Framework • Past work • SDM’05, TODS’06 • Domain-independent framework • Viewing the dataset as an Entity Relationship Graph • Analyzes paths in this graph • Solid theoretic foundation • Optimization problem • Scales to large datasets • Robust under uncertainty • High disambiguation quality • No Self-tuning • This paper solves this challenge DASFAA 2007, Bangkok, Thailand

  10. Entity-Relationship Graph • Choice node • For uncertain references • To encode options/possibilities yr1, … yrN • Among options yr1, … yrN • Pick the most strongly connected one • CAP principle • Analyze paths in G • that exist between xr and yrj, for all j • Use a model to measure connection strength • “Connection strength” model • c(u,v), for nodes u and v in G • how strongly u and v are connected in G • RandomWalk-based • Fixed • Based onIntuition!!! • This paper, instead, learns such a model from data. DASFAA 2007, Bangkok, Thailand

  11. Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

  12. Adaptive Solution • Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} • If paths p1 and p2 are of the same type then they are treated as identical. • We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} • If there is a way to associate path Ti to wi then connection strengthwill be: DASFAA 2007, Bangkok, Thailand

  13. Problems to Answer • How will we classify the paths? • How will we associate each path type with a weight? DASFAA 2007, Bangkok, Thailand

  14. Classifying Paths • Path Type Model (PTM): • Views each path as a sequence of edges • <e1,e2,e3,…,en> • Each edge ei has a type Ei associated with it • Thus, can associate each path p with a string • <E1,E2,E3,…,En> • Different strings correspond to different path types • Associate each string a weight • Different models are also possible DASFAA 2007, Bangkok, Thailand

  15. Learning Path Weights : Optimization Problem • CAP Principle states that: • the right option will be better connected • Linear programming • Learn path types weight w’s. DASFAA 2007, Bangkok, Thailand

  16. Final Solution • The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j • Then final solution: DASFAA 2007, Bangkok, Thailand

  17. Example -Graph P1= e1-e3-e1 P2= e1-e1-e3 P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3 DASFAA 2007, Bangkok, Thailand

  18. Example- Solution • w1 =1 • w3 = w4 = 0 • w2 can be anything between 0 and 1. DASFAA 2007, Bangkok, Thailand

  19. Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

  20. Experimental Setup Parameters • When looking for L-short simple paths, L = 5 • L is the path-length limit RealMov: • movies (12K) • people (22K) • actors • directors • producers • studious (1K) • producing • distributing • ground truth is known SynPub datasets: • many ds of five different types • emulation of RealPub • publications (5K) • authors (1K) • organizations (25K) • departments (125K) • ground truth is known DASFAA 2007, Bangkok, Thailand

  21. Experimental Results on Movies • Parameters : • Fraction : fraction of uncertain references in the dataset • Each reference has 2 choices DASFAA 2007, Bangkok, Thailand

  22. Experimental Results on Movies- II Number of options based on PMF Distribution DASFAA 2007, Bangkok, Thailand

  23. Hybrid Model : Experimental Results on SynPub RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? DASFAA 2007, Bangkok, Thailand

  24. Effect of Random Relationships in the Publications Domain DASFAA 2007, Bangkok, Thailand

  25. Summary • Main Contribution • An adaptive solution for connection strength • Model learns the weights of different path types • Ongoing work • Using different models to learn the importance of paths in the connection strength • Use of standard machine learning techniques for learning: such as decision trees, etc… • Different ways to classify paths DASFAA 2007, Bangkok, Thailand

  26. Contact Information • RelDC project • www.ics.uci.edu/~dvk/RelDC • www.itr-rescue.org (RESCUE) • Rabia Nuray-Turan (contact author) • www.ics.uci.edu/~rnuray • Dmitri V. Kalashnikov • www.ics.uci.edu/~dvk • Sharad Mehrotra • www.ics.uci.edu/~sharad DASFAA 2007, Bangkok, Thailand

  27. Thank you !

More Related