1 / 27

SimRank: A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity. Glen Jeh & Jennifer Widom KDD 2002. Motivation. Many applications require a measure of “similarity” between objects. Web search Shopping Recommendations Search for “Related Works” among scientific papers

naomi
Télécharger la présentation

SimRank: A Measure of Structural-Context Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimRank: A Measure of Structural-Context Similarity Glen Jeh & Jennifer Widom KDD 2002

  2. Motivation • Many applications require a measure of “similarity” between objects. • Web search • Shopping Recommendations • Search for “Related Works” among scientific papers • But “similarity” may be domain-dependent. • Can we define ageneric model forsimilarity?

  3. Common Ground • What do all these applications have in common? data set of objects linked by a set of relations. • Then, a generic concept of similarity is structural-context similarity. • “Two objects are similar if the relate to similar objects.” • Recall automorphic equivalence: • “Two objects are equivalent if the relate to equivalent objects.”

  4. Problem Statement • Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural-context similarity.

  5. Basic Graph Model • Directed Graph G = (V,E) • V = set of objects • E = set of unweighted edges • Edge (u,v) exists if there is an relation u  v • I(v) = set of in-neighbors of vertex v • O(v) = set of out-neighbors of vertex v

  6. SimRank Similarity • Recursive Model • “Two objects are similar if they are referenced by similar objects” • That is, a ~ b if • c  a and d  b, and • c ~ d • An object is equivalent to itself (score = 1) • Example • ProfA ~ ProfB because both arereferenced by Univ. • StudentA ~ StudentB because theyare referenced by similar nodes{ProfA,ProfB}

  7. Basic SimRank Equation • s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b • s(a,b) is in the range [0, 1] • If a=b, then s(a,b) = 1 • If a≠b, • C is a constant, 0 < C < 1 • if I(a) or I(b) = ∅ , then s(a,b) = 0

  8. Decay Factor C a • X is identical to itself:s(x,x) = 1 • Since we have xa and x b,should s(a,b) = 1 also? • If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1. • But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1. • Therefore, the constant C expresses ourlimited confidence or decay with distance:s(a,b) = C ∙ average similarity of (I(a), I(b)) x b

  9. G2 Paired-Vertex Perspective • Given graph G, define G2=(V2, E2) where • V2=V x V. Each vertex in V2 is a pair of vertices in V. • E2: (a,b)(c,d) in G2 iff ac and bd in G • Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.

  10. Source and Flow of Similarity • SimRank score for a vertex (a,b) in G2= similarity between a and b in G. • The source of similarity is self-vertices, like (Univ, Univ). • Then, similarity propagates along pair-paths in G2, away from the sources. • Note that values decrease away from (Univ, Univ)

  11. SimRank in Bipartite Domains • Bipartite: 2 types of objects • Example: Buyers and Items

  12. Bipartite SimRank Equations • Two types of similarity: • Two buyers are similar if they buy the similar items • Out-neighbors of buyers are relevant: • Two items are similar if they are bought by similar buyers • In-neighbors of items are relevant: • In general, we can use I(.) and/or O(.) for any graph

  13. MiniMax Variant • Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1} • SimRank compares each course of A with each course of B • But intuitively we just want the best matching pairs:s(Eng1A,Eng1B), s(Math1A,Math1B) , etc. • Solution: Two steps • Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction:Min: Final s(A,B) is the smaller of sA(A,B) and sB(A,B) [weakest link]

  14. Computing SimRank • Rk(a,b) = estimate of SimRank after k iterations. • Initialization: • Iteration: • Rk(a,b) is the similarity that has flowed a distance k away from the sources.Rk values are non-decreasing as k increases. • We can prove that Rk(a,b) converges to s(a,b)

  15. Time and Space Complexity • Space complexity : O(n2) to store Rk(a,b) • Time complexity : O(kn2d2), d2 is the average of |I(a)||I(b)| over all vertex pairs (a,b) • To improve performance, we can prune G2: • Idea: vertices that are far apart should have very low similarity. We can approximate it as 0. • Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G2. • space complexity: O(ndr) • time complexity: O(Kndrd2),dr = avg. number of neighbors within radius r.

  16. Random Surfer-Pairs Model • SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards • Background: Basic Forward Random Walk • Motion is in discrete steps, using edges of the graph. • Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors. • Given adjacency matrix A, the probability of walking from x to y is pxy = axy/O(x). • Random Walk as a Markov Process • Initial location is described by the prob. distribution vector π(0) • Prob. of being at y at time 1:

  17. Random Walk Transition Matrices • Given adjacency matrix A: • The forward and backward transition matrices:

  18. Paired Backwards Random Walk • Probability of walking backwards to x in one step: • Two walkers meet at x if they start at a and b, and if one goes x a and the other goes x b, respectively.sx(a,b) = P(meeting at x) = π(a,b) p(xa) p(xb)s(a,b) = P(meeting) = Σxπ(a,b) p(xa) p(xb) • If they start together, they have met,so s(0)xy = 1 if i = j; 0 otherwise [identity matrix] • Then

  19. Experiments: Data Sets • Two data sets • ResearchIndex (www.researchindex.com) • a corpus of scientific research papers • 688,898 cross-reference among 278,628 papers • Student’s transcripts • 1030 undergraduate students in the School of Engineering at Stanford University • Each transcript lists all course that the student has taken so far (average: 40 courses/student)

  20. Performance Validation Metric • Problem: Difficult to know what is the “correct” similarity between items. • Solution: Define a rough domain-specific metric σ(p,q): • For scientific papers, we have two versions: σC(p,q) = fraction of q’s citations also cited by p σT(p,q) = fraction of words in q’s title also in p’s title • For university courses: σD(p,q) = 1 if p, q are in the same department, else 0

  21. Computing the Performance Score • Run the similarity algorithms: • SimRank (naïve, pruned, minmax) • Co-Citation • For each object p and algorithm A, form a set topA,N(p) of the N objects most similar to p. • For each q ∈ topA,N(p), compute σ(p,q). • Return the average σA,N(p) over all q.

  22. Experiment: Scientific Papers • Setup • Used bipartite SimRank, only considering in-neighbors (validation uses out-neighbors) • N ∈ {5, 10, …, 45, 50} • Results • Not very sensitive to decay factors C1 and C2 • Pruning the search radius had little effort on rank order of scores.

  23. Results: Scientific Papers

  24. Experiment: Students and Courses • Setup • Bipartite domain • N ∈ {5, 10} • Results • Min-Max version of SimRank performed the best • Not very sensitive to decay factors C1 and C2

  25. Results: Students and Courses Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

  26. Conclusions • Defined a recursive model of structural similarity between objects in a network • Mathematically formulated SimRank based on the recursive concept • Presented a convergent algorithm to compute SimRank • Described a random-walk interpretation of SimRank equations and scores • Experimentally validated SimRank over two real data sets

  27. Open Issues and Critique • O(n2) is large; scalability needs to be improved. • s(a,b) only includes contributions for paths when a and b are the same distance from some x.What if the distances are offset (total is odd)? • As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)! • Addressed partially by Minimax method

More Related