230 likes | 509 Vues
SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab. Outline. Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work. Introduction.
E N D
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab.
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
Introduction • Many applications require a measure of “similarity” between objects • “find-similar-document” query in search engine • Collaborative filtering in a recommender system
Introduction “Two objects are similar if they are referenced by similar objects” • Propose a general approach that exploits the object-to-objectrelationships in many domains • An algorithm to compute similarity scores between nodes based on the structural context • Intuition behind the algorithm • Similar objects are related to similar objects • The base case is that objects are similar to themselves
Basic Graph Model O (Univ) I (ProfB) • G = (V, E) [vertex, edge] • Nodes in V: objects in the domain • Directed edges in E: relationships between objects • <p, q> : from object p to object q • For a node v, denote: • I(v): the set of in-neighbors of v • O(v): the set of out-neighbors of v • Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| ) • Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
SimRank Similar nodes: {ProfA, ProfB}, {StudentA, StudentB}, {Univ, ProfB}, … • Motivation • Two objects are similar if they are referenced by similar object • Consider an object maximally similar to itself (similarity score of 1)
SimRankBasic SimRankEquation • The similarity between objects a and b: s(a, b) ∈ [0, 1] • C is a constant between 0 and 1 • Confidence level or decay factor • C gives the rate of decay as similarity flows across edges (since C < 1) • If a or b may not have any in-neighbors, s(a,b) = 0 • SimRank scores are symmetric, i.e., s(a,b) = s(b,a) • Similarity between a and b is the average similarity between in-neighbors of a and in-neighbors of b
SimRankBasic SimRank Equation • Similarity can be thought of as “propagating” from pair to pair • Consider the derived graph G2=(V2, E2)where • V2=V x V, represents a pair (a,b) of nodes in G • An edge from (a,b) to(c,d)exists in E2, iffthe edges <a,c>and <b,d> exist in G
SimRankBipartite SimRank • Bipartite domains consist of two types of objects • Recommender system • People are similar if they purchase similar items • Items are similar if they are purchased by similar people
SimRankBipartite SimRank • Bipartite Equation • Directed edges go from people to items • s(A,B) denote the similarity between persons A and B, (A≠B) • s(c,d) denote the similarity between items c and d, (c≠d) • The similarity between persons A and B is the average similarity between the items they purchased • The similarity between items c and d is the average similarity between the people who purchased them
SimRankComputing SimRank - Naïve Method • Rk(a,b) givesthe score between a and b on iteration k • The values Rk(*,*)are non-decreasing as k increase • In experiments, when K = 5, Rkis rapidly converged • Complexity • Space: O(n2) to store the result Rk, • Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)
SimRankComputing SimRank - Pruning • Pruning the logical graph G2 • In naïve method, • All n2 nodes of G2 are considered • Similarity score are computed for every node-pair • Nodes far from a node v has less similarity score with v than nodes near v • Pruning • Set the similarity between two nodes far apart to be 0 • Consider node-pairs only for nodes which are near each other in the range of radius r • Complexity • space: O(ndr), dris average nodes which are near from a node • time: O(Kndrd2)
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
Random Surfer-Pairs Model • For the intuition of similarity scores, provide an intuitive model • Based on “random surfers” • Show the SimRank score s(a,b) measures how soon two random surfers are expected to meet at the same node • Expected Distance • u and v are nodes in strongly connected graph • The ED from u to v is exactly the expected number of steps a random surfer would take before he first reaches v, starting from u • Tour t = <w1, …, wk> • l[t]: length of t • P[t]: probability of traveling t
Random Surfer-Pairs Model m(*,*)=∞ m(v,w)=1 m(u,v)=∞ m(u,w)=∞ m(*,*)=3 • Expected Meeting Distance (EMD) • EMD is symmetric • EMD m(a,b) is simply the expected distance in G2 from (a,b) to any singleton node(x,x) ∈ V2
Random Surfer-Pairs Model • Expected-f Meeting Distance • Our approach to circumvent the “infinite EMD” problem • Map all distances to a finite interval: instead of computing expected length l(t) of a tour • Equivalence to SimRank • S’(*,*) is exactly models that our original definition of SimRank scores
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
Conclusion • Main contribution • A formal definition for SimRank similarity scoring over arbitrary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank • A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation • Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank
Future Work • Address efficiency and scalability issues • Including additional pruning heuristics and disk-based algorithms • Consider ternary (or more) relationships in computing structural-context similarity • Explore the combination of SimRank with other domain-specific similarity measures