260 likes | 382 Vues
Link Prediction in Social Networks. Laks V.S. Lakshmanan (based on Kleinberg and Liben-Nowell . The Link Prediction Problem for Social Networks. CIKM 2003.). The Problem.
E N D
Link Prediction in Social Networks Laks V.S. Lakshmanan (based on Kleinberg and Liben-Nowell. The Link Prediction Problem for Social Networks. CIKM 2003.)
The Problem • Examples: Co-authorship graphs, scientists serving on common PC/editorial board, business leaders working on common board, etc. • Can current state of network be used to predict future links? • Note: factors extraneous to n/w (need to be) ignored (e.g., they met on a beach).
Motivation & Significance • Model for Network evolution. • Predict likely interactions, not explicitly observed, based on observed links – useful for terrorist network monitoring. • Link prediction is one instance of Social Network Analysis. • Application for “Friend” suggestion in online SN. Notice, this takes on a flavor of link recommendation.
Main Contributions • Features intrinsic to n/w can offer a good measure of how likely a future link is. • Can significantly outperform chance. • Need more sophisticated measures than direct ones (e.g., geodesic distances). • Note: Extra-network features can significantly improve prediction accuracy: e.g., keywords describing interests of each scientist, or keywords extracted from their papers/abstracts/titles. Not considered here. • See Al Hasan et al. SDM workshop 2006. http://www.cs.rpi.edu/~zaki/PaperDir/LINK06.pdf • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003.
Link Prediction Problem Formalized • G[t0,t0’] = the multi-graph of social interactions between actors (users) during the training interval [t0,t0’]. • Separate edge for each interaction (paper, phone call, email, ...) with timestamp. • Predict edges in test interval [t1,t1’], where t1>t0’. • E_old = edges present in G[t0,t0’]. • E_new = edges in G[t1,t1’] but not in G[t0,t0’]. • Refinement: Restrict to those nodes adjacent to >=κ edges in G[t0,t0’] and to >=κ’ edges in G[t1,t1’]. Call them the corenodes.
Example • Say κ = κ’ = 2. • Only evaluate how well we do on {John, Ram, Asif, Mary}. • E*_new = {(J,M), (R,M)}, when restricted to the “core” nodes above. • Evaluation: among the top-most likely edges predicted, how well we do on precision and recall. • Precision = analog of soundness. • Recall = analog of completeness. John Mary Hui Asif Ram Nita John Mary Hui Asif Ram
Some simple notions • Triadic closure [Easley & Kleinberg 2010]: if x and y have a common friend, then the likelihood of x, y being friends increases. • Clustering coefficient of node x = fraction of its neighbors that are connected by a link. • Bearman and Moody’s study: teenage girls who have a low clustering coefficient in their network of friends are significantly more likely to contemplate suicide than those whose clustering coefficient is high! (Ouch!)
Various predictors • Notation: Γ(u) = neighbors of u in G[t0,t0’]. • score(u,v) = score that indicates likelihood of (u,v) being in E*_new. • Neighborhood-based: • Shortest distance: (negated) length of shortest path (geodesic). • Common neighbors: Γ(u) пΓ(v). • Jaccard coefficient: |Γ(u)пΓ(v)|/|Γ(u)υΓ(v)|.
More predictors • Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|. • Preferential attachment: |Γ(u)|x|Γ(v)|. • Based on ensemble of paths: • Katz, Hitting/Commute Time, Rooted PageRank, SimRank.
The Katz Score • Katz: ∑iβi|pathsi(u,v)|, where 0<β<1. • Dealing with multi-graphs: • Unweighted boolean. • Weighted count #interactions b/w neighbors. • Shorter paths more affinity. • More paths more affinity. • K = βA + β2A2 + ... = I – (I – βA)-1, where A = adjacency matrix of graph. • Notice computational challenges. • Fact: For convergence, need β < [1/(largest eigenvalue of A)].
Hitting & Commute Time • Hitting & Commute Time: Perform a random walk, starting at u, follow neighbor of current node w.p. 1/deg(curr). • Adapted in the obvious way if edges are weighted. • Hu,v=expected time for walk to reach v. • Not symmetric. • Cu,v=Hu,v+Hv,u. • πx=stationary distribution weight of x. • Stationary normed versions: • Hu,vπv. • Hu,vπv+Hv,uπu. • Score, in all cases, negative of measure. (Why?)
Rooted PageRank • Rooted PR:πv in the following RW: At any step, • Jump to u with probability p. • Move to one of the neighbors of current node at random with probability 1-p. • Avoids some pitfalls of HT/CT by minimizing dependence on nodes far away from both u and v. • Computational challenge (of RPR, HT, CT) – similar to PageRank.
SimRank • SimRank: SR(u,u) = 1. When u != v, SR(u,v) = γ.∑wЄΓ(u)∑xЄΓ(v)SR(w,x)/|Γ(u)||Γ(v)|, where 0<γ<1. SR(u,v) = 0 when Γ(u) u Γ(v) = . RW interpretation: E[γi], where i=earliest time at which RWs started at u and v meet for the first time. Qn: Can the RW interpretation be leveraged to exploit PageRank-like techniques to compute SimRank?
Predictors Summary • Neighborhood-based (easy to compute) and Path-based (more accurate). • Output sorted in descending score order of pages. • Consider top-n pages among core nodes for evaluation. • A performance trick, which pays off with quality as well: use low rank approximationAkof adjacency matrix A. • Works very well since usually A is extremely sparse.
Meta-measures • score*(u,v) = |Γ(v) п Top-K(u)|, for some K. //how many of your neighbors are my nearest score-neighbors?// • score*(u,v) = ∑zЄΓ(v)пTop-K(u)score(u,z). //what is the total strength of my relationship with my nearest score-neighbors who are your neighbors?// unseen bigrams.
Conclusions • While many predictors significantly outperform random, their precision can use considerable improvement (e.g., < 20%). • Need for better prediction algorithms: • Perhaps more recent training data is more important? • Combining measures (e.g., using Bayesian classifier). • Leveraging node properties. • Need fast algorithms for approximating measures.
Recommended Papers • Link Prediction: & SN evolution • Al Hasan et al. Link prediction using supervised learning. SDM 2006 Workshop. • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003. • Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of social networks. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD'08). • Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (KDD'05).
Recommended Papers • Computational Aspects: • Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz status score algorithm. Computational & Mathematical Organization Theory, 7(4):275285, 2001. • PurnamritaSarkar and Andrew W. Moore. A tractable approach to finding closest truncated-commute-time neighbors in large graphs. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI'07). • PurnamritaSarkar, Andrew W. Moore, and AmitPrakash. Fast incremental proximity search in large graphs. In Proceedings of the 25th International Conference on Machine Learning (ICML'08). • Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the Eighth ACM International Conference on Knowledge Discovery and Data Mining (KDD'02). • Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for SimRank computation. VLDB Journal, 2008. • Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3), 2005. • Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structural similarity measure over information networks. CIKM 2009. • P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structural similarity measure over information networks. InCIKM, pages 553{562, 2009. • Pei Lee, Laks V.S. Lakshmanan and Jeffrey Xu Yu. On Top-k Structural Similarity Search.ICDE 2012.
Recommended Papers • Surveys & Some Theory: • LiseGetoorand Christopher P. Diehl. Link Mining: A Survey. SIGKDD Explorations 2005. • LinyuanLü and Tao Zhou. Link prediction in complex networks: A survey. PhysicaA: Statistical Mechanics and its Applications. Volume 390, Issue 6, 15 March 2011, Pages 1150–1170. • Mohammad Al Hasanand Mohammed J. Zaki. A Survey of Link Prediction in Social Networks. Social Network Data Analytics 2011, pp. 243-275. • PurnamritaSarkar, DeepayanChakrabarti, and Andrew W. Moore. 2011. Theoretical justification of popular link prediction heuristics. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three (IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three. AAAI Press 2722-2727.