Self-tuning in Graph-Based Reference Disambiguation

Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine

Overview • Intro to Data Cleaning • Entity resolution • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

Data Cleaning Analysis on bad data leads to wrong conclusions DASFAA 2007, Bangkok, Thailand

Example of the problem: CiteSeer top-K Suspicious entries • Lets go to DBLP website • which stores bibliographic entries of many CS authors • Lets check two people • “A. Gupta” • “L. Zhang” CiteSeer: the top-k most cited authors DBLP DBLP DASFAA 2007, Bangkok, Thailand

Two Most Common Entity-Resolution Challenges Fuzzy lookup • reference disambiguation • match references to objects • list of all objects is given Fuzzy grouping • group together object repre-sentations, that correspond to the same object DASFAA 2007, Bangkok, Thailand

Standard Approach to Entity Resolution DASFAA 2007, Bangkok, Thailand

Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand

RelDC Framework DASFAA 2007, Bangkok, Thailand

RelDC Framework • Past work • SDM’05, TODS’06 • Domain-independent framework • Viewing the dataset as an Entity Relationship Graph • Analyzes paths in this graph • Solid theoretic foundation • Optimization problem • Scales to large datasets • Robust under uncertainty • High disambiguation quality • No Self-tuning • This paper solves this challenge DASFAA 2007, Bangkok, Thailand

Entity-Relationship Graph • Choice node • For uncertain references • To encode options/possibilities yr1, … yrN • Among options yr1, … yrN • Pick the most strongly connected one • CAP principle • Analyze paths in G • that exist between xr and yrj, for all j • Use a model to measure connection strength • “Connection strength” model • c(u,v), for nodes u and v in G • how strongly u and v are connected in G • RandomWalk-based • Fixed • Based onIntuition!!! • This paper, instead, learns such a model from data. DASFAA 2007, Bangkok, Thailand

Adaptive Solution • Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} • If paths p1 and p2 are of the same type then they are treated as identical. • We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} • If there is a way to associate path Ti to wi then connection strengthwill be: DASFAA 2007, Bangkok, Thailand

Problems to Answer • How will we classify the paths? • How will we associate each path type with a weight? DASFAA 2007, Bangkok, Thailand

Classifying Paths • Path Type Model (PTM): • Views each path as a sequence of edges • <e1,e2,e3,…,en> • Each edge ei has a type Ei associated with it • Thus, can associate each path p with a string • <E1,E2,E3,…,En> • Different strings correspond to different path types • Associate each string a weight • Different models are also possible DASFAA 2007, Bangkok, Thailand

Learning Path Weights : Optimization Problem • CAP Principle states that: • the right option will be better connected • Linear programming • Learn path types weight w’s. DASFAA 2007, Bangkok, Thailand

Final Solution • The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j • Then final solution: DASFAA 2007, Bangkok, Thailand

Example -Graph P1= e1-e3-e1 P2= e1-e1-e3 P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3 DASFAA 2007, Bangkok, Thailand

Example- Solution • w1 =1 • w3 = w4 = 0 • w2 can be anything between 0 and 1. DASFAA 2007, Bangkok, Thailand

Experimental Setup Parameters • When looking for L-short simple paths, L = 5 • L is the path-length limit RealMov: • movies (12K) • people (22K) • actors • directors • producers • studious (1K) • producing • distributing • ground truth is known SynPub datasets: • many ds of five different types • emulation of RealPub • publications (5K) • authors (1K) • organizations (25K) • departments (125K) • ground truth is known DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies • Parameters : • Fraction : fraction of uncertain references in the dataset • Each reference has 2 choices DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies- II Number of options based on PMF Distribution DASFAA 2007, Bangkok, Thailand

Hybrid Model : Experimental Results on SynPub RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? DASFAA 2007, Bangkok, Thailand

Effect of Random Relationships in the Publications Domain DASFAA 2007, Bangkok, Thailand

Summary • Main Contribution • An adaptive solution for connection strength • Model learns the weights of different path types • Ongoing work • Using different models to learn the importance of paths in the connection strength • Use of standard machine learning techniques for learning: such as decision trees, etc… • Different ways to classify paths DASFAA 2007, Bangkok, Thailand

Contact Information • RelDC project • www.ics.uci.edu/~dvk/RelDC • www.itr-rescue.org (RESCUE) • Rabia Nuray-Turan (contact author) • www.ics.uci.edu/~rnuray • Dmitri V. Kalashnikov • www.ics.uci.edu/~dvk • Sharad Mehrotra • www.ics.uci.edu/~sharad DASFAA 2007, Bangkok, Thailand

Thank you !

Self-tuning in Graph-Based Reference Disambiguation

Self-tuning in Graph-Based Reference Disambiguation

Presentation Transcript

Self-reference systems

Disambiguation

Self tuning regulators

Graph-based Segmentation

Graph-Based Perspective

Adaptive Self-Tuning Memory in DB2

SELF TUNING OF CONTROLLERS

Self-Tuning Musical Chime

Graph-Based Segmentation

Self-Tuning Database systems

Self-Reference And Undecidability

A Self-Tuning Configurable Cache

Self tuning regulators

Graph-based Planning

Graph-based Planning

Graph-based Segmentation

A Self-Tuning Configurable Cache

Adaptive Self-Tuning Memory in DB2

Incidental self-reference effects in memory

Self-Reference And Undecidability

Graph-based Planning