Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics

Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics ShashidharSundareisanVirginia Tech JillesVreekenMax Planck Institute B. Aditya Prakash Virginia Tech SDM, Vancouver May 1, 2015

Contagions • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • Localized effects: riots… Sundareisan, Vreeken, Prakash 2015

Virus Propagation • Susceptible-Infected (SI) Model [AJPH 2007] β CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Sundareisan, Vreeken, Prakash 2015

Culprits Motivation • Patient zeroes • Who started the epidemic? • Rumors • Who started the rumor? Sundareisan, Vreeken, Prakash 2015

But: Real data is noisy! We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level has a certain probability to miss some truly infected people Sundareisan, Vreeken, Prakash 2015

Real data is noisy! Correcting missing data is by itself very important • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Sundareisan, Vreeken, Prakash 2015

Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

The Problem • GIVEN: • Graph G(V, E) from historical data • Infected set D V, sampled (p%) and incomplete • Infectivity β of the virus (assumed to follow the SI model) • FIND: • Seed set i.e. patient zeros/culprits • Set C- (the missing infected nodes) • Ripple R (the order of infections) Sundareisan, Vreeken, Prakash 2015

Related Work – Culprits (Partial) • Shah and Zaman, IEEE TIT, 2011 • One seed. • Provably finds MLE seed for d-regular trees • SI process • Lappas et. al., KDD, 2010. • k seeds (takes in Input k) • Infected graph assumed to be in steady-state • IC model • Prakash et. al., ICDM, 2012. (NetSleuth) • Finds number of seeds automatically. • Assumes no mistakes in infected set D. Sundareisan, Vreeken, Prakash 2015

Related Work – Missing Nodes (Partial) • Costenbader and Valente 2003; Kossinets 2006, Borgatti et al. 2006 • Study the effect of sampling on macro levelnetworkstatistics • Adiga et. al. 2013 • Sensitivity of total infections to noise in network structure • Sadikov et al., WSDM, 2011 • correct for sampling for macro level cascade statistics Sundareisan, Vreeken, Prakash 2015

Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015

MDL-Minimum Description Length Principle • Occam’s Razor • Simplest model is the best model • “Induction by Compression” • Related to Bayesian approaches • MDL cost in bits = Model cost + Data cost • Best model least cost in bits Data + Model Channel Sundareisan, Vreeken, Prakash 2015 Sender Receiver

MDL Encoding For Our Problem The Model Seeds (S), Ripple (R) Missing nodes (C-) Sender Receiver Graph G(V, E) Infectivity (β) Sampling (p) Seeds (S) Infected set (D C-) Ripple (R) Missing nodes (C-) Graph G(V, E) Infectivity (β) Sampling (p) Data given model Sundareisan, Vreeken, Prakash 2015

Model (S, R) Cost • Scoring the seed set (S) • Scoring the ripple? Number of possible |S|-sized sets En-coding integer |S| Sundareisan, Vreeken, Prakash 2015

Model (S, R) Cost • Scoring a ripple (R) Infected Snapshot Original Graph Ripple R1 Ripple R2 Sundareisan, Vreeken, Prakash 2015

Model (S, R) Cost • Ripple cost Ripple R How the ‘frontier’ advances How long is the ripple Sundareisan, Vreeken, Prakash 2015

Cost of the data (C-) • We have to transmit the missed nodes C- (green nodes) • So that receiver can recover D Detail:γ = 1 – p i.e. the probability of a node to be truly missing Sundareisan, Vreeken, Prakash 2015

Total MDL Cost • Finally • And our problem is now • Find S, R, C- to minimize it Sundareisan, Vreeken, Prakash 2015

Our Approach: Decoupling • The two problems are • Finding seeds/ripple (S, R) • Finding Missing nodes (C-) • Can we decouple them? Sundareisan, Vreeken, Prakash 2015

Decoupling the problems (contd.) • Finding seeds depends on missing nodes. Legend Missing nodes Seed Infected node NetSleuth: correct missing nodes filled in as input NetSleuth: No missing nodes as Input Sundareisan, Vreeken, Prakash 2015

Decoupling the problems (cont.) • Finding missing nodes also depends on seeds. Not Infected Infected Most probably A was missed B Seed S A Sundareisan, Vreeken, Prakash 2015

Finding missing nodes (S) and culprits (C-) • Suppose an oracle gives us the missing nodes (C-) • We have complete infected set (D U C-) • Apply NetSleuth directly • NO SAMPLING INVOLVED • Will give us the seed set Legend Missing nodes Seed Infected node * Prakash et. al., ICDM 2012 Applying NetSleuth* on Oracle’s Answer Sundareisan, Vreeken, Prakash 2015

Missing Nodes (C-) given (S) • Oracle gives us S, find C- • Naïve Approach? • Find all possible C- • Pick the best one according to MDL • Infeasible! ( sets) Sundareisan, Vreeken, Prakash 2015

Sub Problem 1: Best hidden hazard given one seed • Best node is one which makes the Seed s more likely • We use empirical risk as the measure • Sanity Check: ideally risk should be 0 • So best hidden hazard, Sundareisan, Vreeken, Prakash 2015

Sub-Problem 1: Best Hidden Hazard • Using some results in Prakash et. al. 2012 (see details in paper), we can rewrite it as u1 is the eigenvector corresponding to the smallest eigenvalue of the Laplaciansubmatrixof D Sundareisan, Vreeken, Prakash 2015

Detour: LaplacianSubmatrix • Laplacian = Deg(G) – A(G) • LD = take only rows fornodes in D (Laplaciansubmatrix) • u1 (smallest eigenvalue’s eigenvector) Laplacian Degree Adjacency Laplacian LaplacianSubmatrix D ƛ Eigenvector

Okay • How to solve this quickly? Proof Omitted: see paper Sundareisan, Vreeken, Prakash 2015

Best hidden hazard • Choose n* such Measures • how connected a node n is to centrally located infected nodes w.r.t. s in D • Depends on the seed as well as the structure Sundareisan, Vreeken, Prakash 2015

Sub-Problem 2: How many missing nodes? • MDL? • Add nodes based on Z-scores till MDL increases. • MDL is not convex! • But it has convex like behavior….. Sundareisan, Vreeken, Prakash 2015

Sub-Problem - 3: What if |Seeds| > 1 Using z-scores: Missing nodes are near one seed Ideal: Missing nodes near both seeds Sundareisan, Vreeken, Prakash 2015

Sub problem 3: What if |Seeds| > 1 • Exonerate previous seeds • Make previous seeds uninfected and calculate u1 • The blame is transferred to the locality of the older seed • Complete Z-score = maxover all seeds Z-score (n) • Maximum as we need high quality missing nodes • Take nodes with top-k complete Z-scores Sundareisan, Vreeken, Prakash 2015

Finding missing nodes given seeds Phew! Sundareisan, Vreeken, Prakash 2015

The complete algorithm – NetFill (Outline) Running time: sub-quadratic in practice Sundareisan, Vreeken, Prakash 2015

Datasets • Real and Synthetic graphs. • Real and Simulated cascades. • Graphs • GRID • AS-OREGON • FLIXSTER • a fridendship network with movie ratings • Cascade: the same movie rating from friends • MEME-TRACKER • hl-mt and hl-hl Gomez-Rodriguez et. al. 2010 Sundareisan, Vreeken, Prakash 2015

Baselines • NETSLEUTH • Simulation • Simulate the SI process till we reach D • Seeds = Input. • Missing nodes = I \ D. • Frontier • Nodes “next in line” to be infected. • At the boundary (frontier) of infected set. • Seeds = Find seeds given missing nodes ( NetSleuth on Frontier + data D) Sundareisan, Vreeken, Prakash 2015

Visualizing Performance (Grid connected) NetSleuth Seeds Missing nodes Simulation Seeds Missing nodes Frontier Seeds Missing nodes NetFill Seeds Missing nodes Legend: Correct FP FN Seeds Infected Sundareisan, Vreeken, Prakash 2015

MDL Grid: Finding the correct size of Missing nodes (automatically) Sundareisan, Vreeken, Prakash 2015

Evaluation Metrics (Subtleties) • For the accuracy of C- (missing nodes) • Jaccard, precision, recall, f-measure do not consider TN. • MCC-Matthew’s correlation coefficient. Confusion matrix -1 <= MCC <= 1 Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

Evaluation Metrics (contd.) • For seeds (S) and ripple (R) • Q = MDL(algorithm) / MDL(true) • From literature (see paper for details) • Again, closer to 1 the better Sundareisan, Vreeken, Prakash 2015

Grid-connected (Synthetic Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

AS-Oregon (Real Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

Meme-Tracker HL-MT(Real Graph, Real Cascades) See Paper for more experiments e.g. scalability, robustness etc. Closer to 1 the better Sundareisan, Vreeken, Prakash 2015

Meme-Tracker– case study • 96,000 node graph for the meme “State of the economy” • Found missing websites like “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts. Sundareisan, Vreeken, Prakash 2015

Conclusions • Given: Graph and sampled infections Find: missing infections and culprits • Formulated the problem • Using MDL • Two-stage alternating optimization • Find best seeds given missing nodes • Find best missing nodes given seeds • NetFill • Subquadratic (near-linear in many cases) • Outperforms baselines in real and synthetic data NetFill on a grid Sundareisan, Vreeken, Prakash 2015

Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics