260 likes | 430 Vues
StaticGreedy: Solving the Scalability-Accuracy Dilemma in Influence Maximization. Suqi Cheng Research Center of Web Data Sciences & Engineering Institute of Computing Technology, Chinese Academy of Sciences chengsuqi@ict.ac.cn,chengsuqi@gmail.com http://www.nascgroup.org/~ chengsuqi.
 
                
                E N D
StaticGreedy: Solving the Scalability-Accuracy Dilemma in Influence Maximization Suqi Cheng Research Center of Web Data Sciences & Engineering Institute of Computing Technology, Chinese Academy of Sciences chengsuqi@ict.ac.cn,chengsuqi@gmail.com http://www.nascgroup.org/~chengsuqi Authors: Suqi Cheng, Huawei Shen, Junming Huang, Guoqing Zhang, Xueqi Cheng
Outline • Background • Preliminaries • Motivation • StaticGreedy algorithm • Experiments
Information Cascade • An action or idea are adopted one by one due to social influence • cascade through social relationships • Main Applications • Word-of-Mouth marketing • Out-break detection • Popularity prediction social network
Word-of-Mouth Marketing • To promote a product by seeding a few users; users adopting the product will recommend it • Advantages: efficient; cost-effective follow-up activated users Company seed users How to select the optimal seed users? free product/ discount influence
Influence Maximization for Viral Marketing • Objective function • Influence spreadI(S) : expected number of activated (influenced/adpoted) nodes • Maximize I(S) • Input: • A social influence graph G=(V, E) • An information cascade model • An integer k, |S| ≤ k • Output: A seed set S
Information Cascade Model • Independent cascade (IC) model • each edge (u, v) has a propagation probability p(u, v) • each newly activated node uindependently activates its out-neighbor v with probability p(u, v) • a discrete time model • Influence spread estimation on IC model • Monte Carlo simulation • Heuristic methods 0.2 0.1 0.1 0.3 0.1 0.5 0.2 0.5 0.1 0.4 0.3 0.4 0.4 0.2 0.1 Social influence graph [Leskovec, 2008]
Difficulties in Influence Maximization Difficulty 1: Influence maximization problem is NP-hard.[kempe, KDD’03] Existing solutions • Heuristics • Degree • Pagerank • Betweennes • efficient • inaccurate • Greedy approximate algorithm[Kempe, KDD’03] • (1-1/e-ε)-approximation • iteratively select nodes with largest marginal influence spread • guaranteed by submodularityand montonicityproperties of influence spread function • accurate • inefficient
Difficulties in Influence Maximization Difficulty 2: To exactly compute influence spread is #P-hard. [Chen, KDD’10] Existing solutions • Monte-Carlo simulation • CELF optimization[Leskovec,KDD’07] • NewGreedy[Chen, KDD’09] • CELF++ optimization[Goyal,WWW’11] • accurate • time-consuming • Heuristic methods • DegreeDiscount[Chen, KDD’09] • CGA[Wang, KDD‘10] • PMIA[Chen,KDD’10] • IRIE[Jung, ICDM’12] • efficient • inaccurate A scalability-accuracy delimma!
Our works • Objective : to propose an influence maximization algorithm to solve the scalability-accuracy dilemma
Preliminaries-1 • Social influence graph: G=(V, E), n=|V|, m=|E| • Influence spread: I(S) • Marginal influence spread: M(v|S)=I(S{v}) - I(S) • Properties of I(S) under independent cascade model • submodularity: I(S{v}) - I(S)  I(T{v}) - I(S) iff vV, S  T  V • monotonicity: I(S{v})  I(S) guarantee • Greedy approximate algorithm • iteratively select nodes withthe largest marginal influence spread • provide 1-1/e-ε approximation Influence spread estimation
Preliminaries-2 • Monte Carlo simulation for influence spread estimation • to approximate true values of influence spread by realizations equivalent
Motivation • In existing greedy algorithms • a risk of unguaranteed submodularity and monotonicity of influence spread function • caused by using different results of Monte Carlo simulation across different influence spread estimation • a very large value of R is required, e.g. R=20000 R: number of Monte Carlo simulations for estimation iteration 2 iteration 1 Submodularity is breaked! snapshot 2 snapshot1 influence graph
StaticGreedy algorithm • Core idea: to always use the same snapshots for influence spread estimation • influence spread function is submodular and monotone • a small value of R is required, e.g. R=100 Part1: Generate R static snapshots Part 2: Greedy selection
Performance analysis: Convergence rate • provide (1-1/e-ε)-approximation with a small value of R seed set size = 50 dR,k log R NetHEPT: a benchmark network uniform independent cascade (UIC) model: p(u, v) = p = 0.01 weighted independent cascade (WIC) model: p(u, v) = 1/(# of in-neighbors of v)
Performance analysis: Scalability Running time Minimal R required ≈102 times ≈103 times log Rmin log running time (sec) seed set size seed set size R is significantly reduced Running time is significantly reduced
Performance analysis: Complexity n: number of nodes in social influence graph m: number of edges in social influence graph m’: expected number of edges in a snapshot
Speed up StaticGreedy • A dynamic update strategy • calculates the marginal gain in an efficient incremental manner • at each step t, for each snapshot: M(v)  M(v) - |R(v)R(vt*)|, R(v)  R(v) - R(v)R(vt*) • trades space for time R(v): reachable nodes from v in the snapshot v1 initial v1 v2 M(v1)=4 M(v2)=3 M(v3)=2 M(v4)=1 M(v5)=1 M(v6)=1 M(v7)=2 M(v8)=1 v3 v4 v5 v6 v7 v8 snapshot
Speed up StaticGreedy • A dynamic update strategy • calculates the marginal gain in an efficient incremental manner • at each step t, for each snapshot: M(v)  M(v) - |R(v)R(vt*)|, R(v)  R(v) - R(v)R(vt*) • trades space for time R(v): reachable nodes from v in the snapshot v1 after select v* = v1 X -4 v1 -1 v2 M(v1)=0 M(v2)=2 M(v3)=0 M(v4)=0 M(v5)=1 M(v6)=0 M(v7)=2 M(v8)=1 M(v1)=4 M(v2)=3 M(v3)=2 M(v4)=1 M(v5)=1 M(v6)=1 M(v7)=2 M(v8)=1 X X v3 v4 v5 -2 -1 directly update X -1 v6 v7 v8 snapshot
Experiments: setup • Algorithms: • Our algorithms: StaticGreedyCELF, StaticGreedyDU • Baselines: CELFGreedy, SP1M, PMIA, Degree, DegreeDiscount • Tested datasets • Independent cascade models • uniform independent cascade(UIC) model: p(u, v) = p = 0.01 • weighted independent cascade(WIC) model: p(u, v) = 1/(# of in-neighbors of v) • Metrics: Influence spread, running time
Experiments: influence spread • StaticGreedy achieves better accuracy than other heuristics NetPHY UIC model WIC model DBLP UIC model WIC model
Experiments: running time • StaticGreedy runs >103 times faster than CELFGreedy • StaticGreedy has comparable scalability to state-of-the-art heuristics • StaticGreedyDU always runs faster than StaticGreedyCELF log running time (sec) UIC model WIC model
conclusion • Essential reason of the inefficiency of existing greedy algorithms • a risk of unguaranteed submodularity and monotonicity • caused by different Monte Carlo simulations across different estimations • a very large value of R is required  guaranteed accuracy + inefficiency • StaticGreedy algorithm • guaranteed submodularity and monotonicity • using the same Monte Carlo simulations across different estimations • a small value of R is required  guaranteed accuracy + high scalability • runs >103 times quicker than conventional greedy algorithms • A dynamic update strategy to speed up StaticGreedy • about 10 times faster
Thank you! Q & A