260 likes | 523 Vues
This study, based on Kleinberg and Liben-Nowell's Link Prediction Problem for Social Networks, explores predicting future links using the current state of a network. It delves into network evolution models and the prediction of likely interactions for applications such as friend suggestion and terrorist network monitoring. The research emphasizes the importance of intrinsic network features for accurate predictions and introduces various predictors like Adamic/Adar, Katz Score, and SimRank. Computational challenges, including the Katz score convergence requirement, are discussed along with methods such as Hitting/Commute Time and Rooted PageRank to enhance link prediction accuracy. The study showcases how link prediction can be a valuable tool in social network analysis and its practical implications in different contexts.
 
                
                E N D
Link Prediction in Social Networks Laks V.S. Lakshmanan (based on Kleinberg and Liben-Nowell. The Link Prediction Problem for Social Networks. CIKM 2003.)
The Problem • Examples: Co-authorship graphs, scientists serving on common PC/editorial board, business leaders working on common board, etc. • Can current state of network be used to predict future links? • Note: factors extraneous to n/w (need to be) ignored (e.g., they met on a beach).
Motivation & Significance • Model for Network evolution. • Predict likely interactions, not explicitly observed, based on observed links – useful for terrorist network monitoring. • Link prediction is one instance of Social Network Analysis. • Application for “Friend” suggestion in online SN. Notice, this takes on a flavor of link recommendation.
Main Contributions • Features intrinsic to n/w can offer a good measure of how likely a future link is. • Can significantly outperform chance. • Need more sophisticated measures than direct ones (e.g., geodesic distances). • Note: Extra-network features can significantly improve prediction accuracy: e.g., keywords describing interests of each scientist, or keywords extracted from their papers/abstracts/titles. Not considered here. • See Al Hasan et al. SDM workshop 2006. http://www.cs.rpi.edu/~zaki/PaperDir/LINK06.pdf • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003.
Link Prediction Problem Formalized • G[t0,t0’] = the multi-graph of social interactions between actors (users) during the training interval [t0,t0’]. • Separate edge for each interaction (paper, phone call, email, ...) with timestamp. • Predict edges in test interval [t1,t1’], where t1>t0’. • E_old = edges present in G[t0,t0’]. • E_new = edges in G[t1,t1’] but not in G[t0,t0’]. • Refinement: Restrict to those nodes adjacent to >=κ edges in G[t0,t0’] and to >=κ’ edges in G[t1,t1’]. Call them the corenodes.
Example • Say κ = κ’ = 2. • Only evaluate how well we do on {John, Ram, Asif, Mary}. • E*_new = {(J,M), (R,M)}, when restricted to the “core” nodes above. • Evaluation: among the top-most likely edges predicted, how well we do on precision and recall. • Precision = analog of soundness. • Recall = analog of completeness. John Mary Hui Asif Ram Nita John Mary Hui Asif Ram
Some simple notions • Triadic closure [Easley & Kleinberg 2010]: if x and y have a common friend, then the likelihood of x, y being friends increases. • Clustering coefficient of node x = fraction of its neighbors that are connected by a link. • Bearman and Moody’s study: teenage girls who have a low clustering coefficient in their network of friends are significantly more likely to contemplate suicide than those whose clustering coefficient is high! (Ouch!)
Various predictors • Notation: Γ(u) = neighbors of u in G[t0,t0’]. • score(u,v) = score that indicates likelihood of (u,v) being in E*_new. • Neighborhood-based: • Shortest distance: (negated) length of shortest path (geodesic). • Common neighbors: Γ(u) пΓ(v). • Jaccard coefficient: |Γ(u)пΓ(v)|/|Γ(u)υΓ(v)|.
More predictors • Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|. • Preferential attachment: |Γ(u)|x|Γ(v)|. • Based on ensemble of paths: • Katz, Hitting/Commute Time, Rooted PageRank, SimRank.
The Katz Score • Katz: ∑iβi|pathsi(u,v)|, where 0<β<1. • Dealing with multi-graphs: • Unweighted boolean. • Weighted  count #interactions b/w neighbors. • Shorter paths  more affinity. • More paths  more affinity. • K = βA + β2A2 + ... = I – (I – βA)-1, where A = adjacency matrix of graph. • Notice computational challenges. • Fact: For convergence, need β < [1/(largest eigenvalue of A)].
Hitting & Commute Time • Hitting & Commute Time: Perform a random walk, starting at u, follow neighbor of current node w.p. 1/deg(curr). • Adapted in the obvious way if edges are weighted. • Hu,v=expected time for walk to reach v. • Not symmetric. • Cu,v=Hu,v+Hv,u. • πx=stationary distribution weight of x. • Stationary normed versions: • Hu,vπv. • Hu,vπv+Hv,uπu. • Score, in all cases, negative of measure. (Why?)
Rooted PageRank • Rooted PR:πv in the following RW: At any step, • Jump to u with probability p. • Move to one of the neighbors of current node at random with probability 1-p. • Avoids some pitfalls of HT/CT by minimizing dependence on nodes far away from both u and v. • Computational challenge (of RPR, HT, CT) – similar to PageRank.
SimRank • SimRank: SR(u,u) = 1. When u != v, SR(u,v) = γ.∑wЄΓ(u)∑xЄΓ(v)SR(w,x)/|Γ(u)||Γ(v)|, where 0<γ<1. SR(u,v) = 0 when Γ(u) u Γ(v) = . RW interpretation: E[γi], where i=earliest time at which RWs started at u and v meet for the first time. Qn: Can the RW interpretation be leveraged to exploit PageRank-like techniques to compute SimRank?
Predictors Summary • Neighborhood-based (easy to compute) and Path-based (more accurate). • Output sorted in descending score order of pages. • Consider top-n pages among core nodes for evaluation. • A performance trick, which pays off with quality as well: use low rank approximationAkof adjacency matrix A. • Works very well since usually A is extremely sparse.
Meta-measures • score*(u,v) = |Γ(v) п Top-K(u)|, for some K. //how many of your neighbors are my nearest score-neighbors?// • score*(u,v) = ∑zЄΓ(v)пTop-K(u)score(u,z). //what is the total strength of my relationship with my nearest score-neighbors who are your neighbors?//  unseen bigrams.
Conclusions • While many predictors significantly outperform random, their precision can use considerable improvement (e.g., < 20%). • Need for better prediction algorithms: • Perhaps more recent training data is more important? • Combining measures (e.g., using Bayesian classifier). • Leveraging node properties. • Need fast algorithms for approximating measures.
Recommended Papers • Link Prediction: & SN evolution • Al Hasan et al. Link prediction using supervised learning. SDM 2006 Workshop. • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003. • Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of social networks. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD'08). • Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (KDD'05).
Recommended Papers • Computational Aspects: • Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz status score algorithm. Computational & Mathematical Organization Theory, 7(4):275285, 2001. • PurnamritaSarkar and Andrew W. Moore. A tractable approach to finding closest truncated-commute-time neighbors in large graphs. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI'07). • PurnamritaSarkar, Andrew W. Moore, and AmitPrakash. Fast incremental proximity search in large graphs. In Proceedings of the 25th International Conference on Machine Learning (ICML'08). • Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the Eighth ACM International Conference on Knowledge Discovery and Data Mining (KDD'02). • Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for SimRank computation. VLDB Journal, 2008. • Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3), 2005. • Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structural similarity measure over information networks. CIKM 2009. • P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structural similarity measure over information networks. InCIKM, pages 553{562, 2009. • Pei Lee, Laks V.S. Lakshmanan and Jeffrey Xu Yu. On Top-k Structural Similarity Search.ICDE 2012.
Recommended Papers • Surveys & Some Theory: • LiseGetoorand Christopher P. Diehl. Link Mining: A Survey. SIGKDD Explorations 2005. • LinyuanLü and Tao Zhou. Link prediction in complex networks: A survey. PhysicaA: Statistical Mechanics and its Applications. Volume 390, Issue 6, 15 March 2011, Pages 1150–1170. • Mohammad Al Hasanand Mohammed J. Zaki. A Survey of Link Prediction in Social Networks. Social Network Data Analytics 2011, pp. 243-275. • PurnamritaSarkar, DeepayanChakrabarti, and Andrew W. Moore. 2011. Theoretical justification of popular link prediction heuristics. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three (IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three. AAAI Press 2722-2727.