1 / 20

SimRank : A Measure of Structural-Context Similarity

SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab. Outline. Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work. Introduction.

darcie
Télécharger la présentation

SimRank : A Measure of Structural-Context Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab.

  2. Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

  3. Introduction • Many applications require a measure of “similarity” between objects • “find-similar-document” query in search engine • Collaborative filtering in a recommender system

  4. Introduction “Two objects are similar if they are referenced by similar objects” • Propose a general approach that exploits the object-to-objectrelationships in many domains • An algorithm to compute similarity scores between nodes based on the structural context • Intuition behind the algorithm • Similar objects are related to similar objects • The base case is that objects are similar to themselves

  5. Basic Graph Model O (Univ) I (ProfB) • G = (V, E) [vertex, edge] • Nodes in V: objects in the domain • Directed edges in E: relationships between objects • <p, q> : from object p to object q • For a node v, denote: • I(v): the set of in-neighbors of v • O(v): the set of out-neighbors of v • Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| ) • Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )

  6. Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

  7. SimRank Similar nodes: {ProfA, ProfB}, {StudentA, StudentB}, {Univ, ProfB}, … • Motivation • Two objects are similar if they are referenced by similar object • Consider an object maximally similar to itself (similarity score of 1)

  8. SimRankBasic SimRankEquation • The similarity between objects a and b: s(a, b) ∈ [0, 1] • C is a constant between 0 and 1 • Confidence level or decay factor • C gives the rate of decay as similarity flows across edges (since C < 1) • If a or b may not have any in-neighbors, s(a,b) = 0 • SimRank scores are symmetric, i.e., s(a,b) = s(b,a) • Similarity between a and b is the average similarity between in-neighbors of a and in-neighbors of b

  9. SimRankBasic SimRank Equation • Similarity can be thought of as “propagating” from pair to pair • Consider the derived graph G2=(V2, E2)where • V2=V x V, represents a pair (a,b) of nodes in G • An edge from (a,b) to(c,d)exists in E2, iffthe edges <a,c>and <b,d> exist in G

  10. SimRankBipartite SimRank • Bipartite domains consist of two types of objects • Recommender system • People are similar if they purchase similar items • Items are similar if they are purchased by similar people

  11. SimRankBipartite SimRank • Bipartite Equation • Directed edges go from people to items • s(A,B) denote the similarity between persons A and B, (A≠B) • s(c,d) denote the similarity between items c and d, (c≠d) • The similarity between persons A and B is the average similarity between the items they purchased • The similarity between items c and d is the average similarity between the people who purchased them

  12. SimRankComputing SimRank - Naïve Method • Rk(a,b) givesthe score between a and b on iteration k • The values Rk(*,*)are non-decreasing as k increase • In experiments, when K = 5, Rkis rapidly converged • Complexity • Space: O(n2) to store the result Rk, • Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)

  13. SimRankComputing SimRank - Pruning • Pruning the logical graph G2 • In naïve method, • All n2 nodes of G2 are considered • Similarity score are computed for every node-pair • Nodes far from a node v has less similarity score with v than nodes near v • Pruning • Set the similarity between two nodes far apart to be 0 • Consider node-pairs only for nodes which are near each other in the range of radius r • Complexity • space: O(ndr), dris average nodes which are near from a node • time: O(Kndrd2)

  14. Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

  15. Random Surfer-Pairs Model • For the intuition of similarity scores, provide an intuitive model • Based on “random surfers” • Show the SimRank score s(a,b) measures how soon two random surfers are expected to meet at the same node • Expected Distance • u and v are nodes in strongly connected graph • The ED from u to v is exactly the expected number of steps a random surfer would take before he first reaches v, starting from u • Tour t = <w1, …, wk> • l[t]: length of t • P[t]: probability of traveling t

  16. Random Surfer-Pairs Model m(*,*)=∞ m(v,w)=1 m(u,v)=∞ m(u,w)=∞ m(*,*)=3 • Expected Meeting Distance (EMD) • EMD is symmetric • EMD m(a,b) is simply the expected distance in G2 from (a,b) to any singleton node(x,x) ∈ V2

  17. Random Surfer-Pairs Model • Expected-f Meeting Distance • Our approach to circumvent the “infinite EMD” problem • Map all distances to a finite interval: instead of computing expected length l(t) of a tour • Equivalence to SimRank • S’(*,*) is exactly models that our original definition of SimRank scores

  18. Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

  19. Conclusion • Main contribution • A formal definition for SimRank similarity scoring over arbitrary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank • A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation • Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank

  20. Future Work • Address efficiency and scalability issues • Including additional pruning heuristics and disk-based algorithms • Consider ternary (or more) relationships in computing structural-context similarity • Explore the combination of SimRank with other domain-specific similarity measures

More Related