Link Prediction in Social Networks: Methods and Applications

Link Prediction in Social Networks Laks V.S. Lakshmanan (based on Kleinberg and Liben-Nowell. The Link Prediction Problem for Social Networks. CIKM 2003.)

The Problem • Examples: Co-authorship graphs, scientists serving on common PC/editorial board, business leaders working on common board, etc. • Can current state of network be used to predict future links? • Note: factors extraneous to n/w (need to be) ignored (e.g., they met on a beach).

Motivation & Significance • Model for Network evolution. • Predict likely interactions, not explicitly observed, based on observed links – useful for terrorist network monitoring. • Link prediction is one instance of Social Network Analysis. • Application for “Friend” suggestion in online SN. Notice, this takes on a flavor of link recommendation.

Main Contributions • Features intrinsic to n/w can offer a good measure of how likely a future link is. • Can significantly outperform chance. • Need more sophisticated measures than direct ones (e.g., geodesic distances). • Note: Extra-network features can significantly improve prediction accuracy: e.g., keywords describing interests of each scientist, or keywords extracted from their papers/abstracts/titles. Not considered here. • See Al Hasan et al. SDM workshop 2006. http://www.cs.rpi.edu/~zaki/PaperDir/LINK06.pdf • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003.

Link Prediction Problem Formalized • G[t0,t0’] = the multi-graph of social interactions between actors (users) during the training interval [t0,t0’]. • Separate edge for each interaction (paper, phone call, email, ...) with timestamp. • Predict edges in test interval [t1,t1’], where t1>t0’. • E_old = edges present in G[t0,t0’]. • E_new = edges in G[t1,t1’] but not in G[t0,t0’]. • Refinement: Restrict to those nodes adjacent to >=κ edges in G[t0,t0’] and to >=κ’ edges in G[t1,t1’]. Call them the corenodes.

Example • Say κ = κ’ = 2. • Only evaluate how well we do on {John, Ram, Asif, Mary}. • E*_new = {(J,M), (R,M)}, when restricted to the “core” nodes above. • Evaluation: among the top-most likely edges predicted, how well we do on precision and recall. • Precision = analog of soundness. • Recall = analog of completeness. John Mary Hui Asif Ram Nita John Mary Hui Asif Ram

Some simple notions • Triadic closure [Easley & Kleinberg 2010]: if x and y have a common friend, then the likelihood of x, y being friends increases. • Clustering coefficient of node x = fraction of its neighbors that are connected by a link. • Bearman and Moody’s study: teenage girls who have a low clustering coefficient in their network of friends are significantly more likely to contemplate suicide than those whose clustering coefficient is high! (Ouch!)

Various predictors • Notation: Γ(u) = neighbors of u in G[t0,t0’]. • score(u,v) = score that indicates likelihood of (u,v) being in E*_new. • Neighborhood-based: • Shortest distance: (negated) length of shortest path (geodesic). • Common neighbors: Γ(u) пΓ(v). • Jaccard coefficient: |Γ(u)пΓ(v)|/|Γ(u)υΓ(v)|.

More predictors • Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|. • Preferential attachment: |Γ(u)|x|Γ(v)|. • Based on ensemble of paths: • Katz, Hitting/Commute Time, Rooted PageRank, SimRank.

The Katz Score • Katz: ∑iβi|pathsi(u,v)|, where 0<β<1. • Dealing with multi-graphs: • Unweighted boolean. • Weighted  count #interactions b/w neighbors. • Shorter paths  more affinity. • More paths  more affinity. • K = βA + β2A2 + ... = I – (I – βA)-1, where A = adjacency matrix of graph. • Notice computational challenges. • Fact: For convergence, need β < [1/(largest eigenvalue of A)].

Hitting & Commute Time • Hitting & Commute Time: Perform a random walk, starting at u, follow neighbor of current node w.p. 1/deg(curr). • Adapted in the obvious way if edges are weighted. • Hu,v=expected time for walk to reach v. • Not symmetric. • Cu,v=Hu,v+Hv,u. • πx=stationary distribution weight of x. • Stationary normed versions: • Hu,vπv. • Hu,vπv+Hv,uπu. • Score, in all cases, negative of measure. (Why?)

Rooted PageRank • Rooted PR:πv in the following RW: At any step, • Jump to u with probability p. • Move to one of the neighbors of current node at random with probability 1-p. • Avoids some pitfalls of HT/CT by minimizing dependence on nodes far away from both u and v. • Computational challenge (of RPR, HT, CT) – similar to PageRank.

SimRank • SimRank: SR(u,u) = 1. When u != v, SR(u,v) = γ.∑wЄΓ(u)∑xЄΓ(v)SR(w,x)/|Γ(u)||Γ(v)|, where 0<γ<1. SR(u,v) = 0 when Γ(u) u Γ(v) = . RW interpretation: E[γi], where i=earliest time at which RWs started at u and v meet for the first time. Qn: Can the RW interpretation be leveraged to exploit PageRank-like techniques to compute SimRank?

Predictors Summary • Neighborhood-based (easy to compute) and Path-based (more accurate). • Output sorted in descending score order of pages. • Consider top-n pages among core nodes for evaluation. • A performance trick, which pays off with quality as well: use low rank approximationAkof adjacency matrix A. • Works very well since usually A is extremely sparse.

Meta-measures • score*(u,v) = |Γ(v) п Top-K(u)|, for some K. //how many of your neighbors are my nearest score-neighbors?// • score*(u,v) = ∑zЄΓ(v)пTop-K(u)score(u,z). //what is the total strength of my relationship with my nearest score-neighbors who are your neighbors?//  unseen bigrams.

Conclusions • While many predictors significantly outperform random, their precision can use considerable improvement (e.g., < 20%). • Need for better prediction algorithms: • Perhaps more recent training data is more important? • Combining measures (e.g., using Bayesian classifier). • Leveraging node properties. • Need fast algorithms for approximating measures.

Recommended Papers • Link Prediction: & SN evolution • Al Hasan et al. Link prediction using supervised learning. SDM 2006 Workshop. • Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. • A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003. • Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of social networks. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD'08). • Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (KDD'05).

Recommended Papers • Computational Aspects: • Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz status score algorithm. Computational & Mathematical Organization Theory, 7(4):275285, 2001. • PurnamritaSarkar and Andrew W. Moore. A tractable approach to finding closest truncated-commute-time neighbors in large graphs. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI'07). • PurnamritaSarkar, Andrew W. Moore, and AmitPrakash. Fast incremental proximity search in large graphs. In Proceedings of the 25th International Conference on Machine Learning (ICML'08). • Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the Eighth ACM International Conference on Knowledge Discovery and Data Mining (KDD'02). • Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for SimRank computation. VLDB Journal, 2008. • Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3), 2005. • Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structural similarity measure over information networks. CIKM 2009. • P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structural similarity measure over information networks. InCIKM, pages 553{562, 2009. • Pei Lee, Laks V.S. Lakshmanan and Jeffrey Xu Yu. On Top-k Structural Similarity Search.ICDE 2012.

Recommended Papers • Surveys & Some Theory: • LiseGetoorand Christopher P. Diehl. Link Mining: A Survey. SIGKDD Explorations 2005. • LinyuanLü and Tao Zhou. Link prediction in complex networks: A survey. PhysicaA: Statistical Mechanics and its Applications. Volume 390, Issue 6, 15 March 2011, Pages 1150–1170. • Mohammad Al Hasanand Mohammed J. Zaki. A Survey of Link Prediction in Social Networks. Social Network Data Analytics 2011, pp. 243-275. • PurnamritaSarkar, DeepayanChakrabarti, and Andrew W. Moore. 2011. Theoretical justification of popular link prediction heuristics. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three (IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three. AAAI Press 2722-2727.

Link Prediction in Social Networks: Methods and Applications

Link Prediction in Social Networks: Methods and Applications

Presentation Transcript

Direct Link Networks

CoupledLP : Link Prediction in Coupled Networks

Negative Link Prediction and Its Applications in Online Political Networks

Direct Link Networks

Nonparametric Link Prediction in Dynamic Graphs

Link Prediction in Co-Authorship Network

Link Recommendation In P2P Social Networks

Behavior Prediction and Anomaly Detection in Large-Scale Social Networks

A glimpse on social influence and link prediction in OSNs

Contact Prediction, Routing and Fast Information Spreading in Social Networks

Prediction Networks

Link Prediction and Recommendation across Multiple Heterogeneous Networks

On Link Privacy in Randomizing Social Networks

Using Transactional Information to Predict Link Strength in Online Social Networks

Direct Link Networks

I LINK :Search and Routing in Social Networks

Diffusion in (Social) networks

Direct Link Networks

Transfer Learning for Link Prediction

Neural Networks in Social Networks