Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics
530 likes | 553 Vues
This research investigates the problem of finding missing infected nodes in large-scale graph epidemics using a minimum description length principle approach. The study focuses on the identification of patient zeros and the correction of missing data in order to improve epidemiology and public health surveillance.
Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics
E N D
Presentation Transcript
Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics ShashidharSundareisanVirginia Tech JillesVreekenMax Planck Institute B. Aditya Prakash Virginia Tech SDM, Vancouver May 1, 2015
Contagions • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • Localized effects: riots… Sundareisan, Vreeken, Prakash 2015
Virus Propagation • Susceptible-Infected (SI) Model [AJPH 2007] β CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Sundareisan, Vreeken, Prakash 2015
Culprits Motivation • Patient zeroes • Who started the epidemic? • Rumors • Who started the rumor? Sundareisan, Vreeken, Prakash 2015
But: Real data is noisy! We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level has a certain probability to miss some truly infected people Sundareisan, Vreeken, Prakash 2015
Real data is noisy! Correcting missing data is by itself very important • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
The Problem • GIVEN: • Graph G(V, E) from historical data • Infected set D V, sampled (p%) and incomplete • Infectivity β of the virus (assumed to follow the SI model) • FIND: • Seed set i.e. patient zeros/culprits • Set C- (the missing infected nodes) • Ripple R (the order of infections) Sundareisan, Vreeken, Prakash 2015
Related Work – Culprits (Partial) • Shah and Zaman, IEEE TIT, 2011 • One seed. • Provably finds MLE seed for d-regular trees • SI process • Lappas et. al., KDD, 2010. • k seeds (takes in Input k) • Infected graph assumed to be in steady-state • IC model • Prakash et. al., ICDM, 2012. (NetSleuth) • Finds number of seeds automatically. • Assumes no mistakes in infected set D. Sundareisan, Vreeken, Prakash 2015
Related Work – Missing Nodes (Partial) • Costenbader and Valente 2003; Kossinets 2006, Borgatti et al. 2006 • Study the effect of sampling on macro levelnetworkstatistics • Adiga et. al. 2013 • Sensitivity of total infections to noise in network structure • Sadikov et al., WSDM, 2011 • correct for sampling for macro level cascade statistics Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
MDL-Minimum Description Length Principle • Occam’s Razor • Simplest model is the best model • “Induction by Compression” • Related to Bayesian approaches • MDL cost in bits = Model cost + Data cost • Best model least cost in bits Data + Model Channel Sundareisan, Vreeken, Prakash 2015 Sender Receiver
MDL Encoding For Our Problem The Model Seeds (S), Ripple (R) Missing nodes (C-) Sender Receiver Graph G(V, E) Infectivity (β) Sampling (p) Seeds (S) Infected set (D C-) Ripple (R) Missing nodes (C-) Graph G(V, E) Infectivity (β) Sampling (p) Data given model Sundareisan, Vreeken, Prakash 2015
Model (S, R) Cost • Scoring the seed set (S) • Scoring the ripple? Number of possible |S|-sized sets En-coding integer |S| Sundareisan, Vreeken, Prakash 2015
Model (S, R) Cost • Scoring a ripple (R) Infected Snapshot Original Graph Ripple R1 Ripple R2 Sundareisan, Vreeken, Prakash 2015
Model (S, R) Cost • Ripple cost Ripple R How the ‘frontier’ advances How long is the ripple Sundareisan, Vreeken, Prakash 2015
Cost of the data (C-) • We have to transmit the missed nodes C- (green nodes) • So that receiver can recover D Detail:γ = 1 – p i.e. the probability of a node to be truly missing Sundareisan, Vreeken, Prakash 2015
Total MDL Cost • Finally • And our problem is now • Find S, R, C- to minimize it Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
Our Approach: Decoupling • The two problems are • Finding seeds/ripple (S, R) • Finding Missing nodes (C-) • Can we decouple them? Sundareisan, Vreeken, Prakash 2015
Decoupling the problems (contd.) • Finding seeds depends on missing nodes. Legend Missing nodes Seed Infected node NetSleuth: correct missing nodes filled in as input NetSleuth: No missing nodes as Input Sundareisan, Vreeken, Prakash 2015
Decoupling the problems (cont.) • Finding missing nodes also depends on seeds. Not Infected Infected Most probably A was missed B Seed S A Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
Finding missing nodes (S) and culprits (C-) • Suppose an oracle gives us the missing nodes (C-) • We have complete infected set (D U C-) • Apply NetSleuth directly • NO SAMPLING INVOLVED • Will give us the seed set Legend Missing nodes Seed Infected node * Prakash et. al., ICDM 2012 Applying NetSleuth* on Oracle’s Answer Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
Missing Nodes (C-) given (S) • Oracle gives us S, find C- • Naïve Approach? • Find all possible C- • Pick the best one according to MDL • Infeasible! ( sets) Sundareisan, Vreeken, Prakash 2015
Our Approach • Sub-problem 1: |Seeds| = 1 • |Missing nodes| = 1 • Sub-problem 2: Finding the right number of missing nodes. • Sub-problem 3: |Seeds| > 1 Sundareisan, Vreeken, Prakash 2015
Sub Problem 1: Best hidden hazard given one seed • Best node is one which makes the Seed s more likely • We use empirical risk as the measure • Sanity Check: ideally risk should be 0 • So best hidden hazard, Sundareisan, Vreeken, Prakash 2015
Sub-Problem 1: Best Hidden Hazard • Using some results in Prakash et. al. 2012 (see details in paper), we can rewrite it as u1 is the eigenvector corresponding to the smallest eigenvalue of the Laplaciansubmatrixof D Sundareisan, Vreeken, Prakash 2015
Detour: LaplacianSubmatrix • Laplacian = Deg(G) – A(G) • LD = take only rows fornodes in D (Laplaciansubmatrix) • u1 (smallest eigenvalue’s eigenvector) Laplacian Degree Adjacency Laplacian LaplacianSubmatrix D ƛ Eigenvector
Okay • How to solve this quickly? Proof Omitted: see paper Sundareisan, Vreeken, Prakash 2015
Best hidden hazard • Choose n* such Measures • how connected a node n is to centrally located infected nodes w.r.t. s in D • Depends on the seed as well as the structure Sundareisan, Vreeken, Prakash 2015
Sub-Problem 2: How many missing nodes? • MDL? • Add nodes based on Z-scores till MDL increases. • MDL is not convex! • But it has convex like behavior….. Sundareisan, Vreeken, Prakash 2015
Sub-Problem - 3: What if |Seeds| > 1 Using z-scores: Missing nodes are near one seed Ideal: Missing nodes near both seeds Sundareisan, Vreeken, Prakash 2015
Sub problem 3: What if |Seeds| > 1 • Exonerate previous seeds • Make previous seeds uninfected and calculate u1 • The blame is transferred to the locality of the older seed • Complete Z-score = maxover all seeds Z-score (n) • Maximum as we need high quality missing nodes • Take nodes with top-k complete Z-scores Sundareisan, Vreeken, Prakash 2015
Finding missing nodes given seeds Phew! Sundareisan, Vreeken, Prakash 2015
The complete algorithm – NetFill (Outline) Running time: sub-quadratic in practice Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
Datasets • Real and Synthetic graphs. • Real and Simulated cascades. • Graphs • GRID • AS-OREGON • FLIXSTER • a fridendship network with movie ratings • Cascade: the same movie rating from friends • MEME-TRACKER • hl-mt and hl-hl Gomez-Rodriguez et. al. 2010 Sundareisan, Vreeken, Prakash 2015
Baselines • NETSLEUTH • Simulation • Simulate the SI process till we reach D • Seeds = Input. • Missing nodes = I \ D. • Frontier • Nodes “next in line” to be infected. • At the boundary (frontier) of infected set. • Seeds = Find seeds given missing nodes ( NetSleuth on Frontier + data D) Sundareisan, Vreeken, Prakash 2015
Visualizing Performance (Grid connected) NetSleuth Seeds Missing nodes Simulation Seeds Missing nodes Frontier Seeds Missing nodes NetFill Seeds Missing nodes Legend: Correct FP FN Seeds Infected Sundareisan, Vreeken, Prakash 2015
MDL Grid: Finding the correct size of Missing nodes (automatically) Sundareisan, Vreeken, Prakash 2015
Evaluation Metrics (Subtleties) • For the accuracy of C- (missing nodes) • Jaccard, precision, recall, f-measure do not consider TN. • MCC-Matthew’s correlation coefficient. Confusion matrix -1 <= MCC <= 1 Closer to 1 the better Sundareisan, Vreeken, Prakash 2015
Evaluation Metrics (contd.) • For seeds (S) and ripple (R) • Q = MDL(algorithm) / MDL(true) • From literature (see paper for details) • Again, closer to 1 the better Sundareisan, Vreeken, Prakash 2015
Grid-connected (Synthetic Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015
AS-Oregon (Real Graph, Synthetic Cascades) Closer to 1 the better Sundareisan, Vreeken, Prakash 2015
Meme-Tracker HL-MT(Real Graph, Real Cascades) See Paper for more experiments e.g. scalability, robustness etc. Closer to 1 the better Sundareisan, Vreeken, Prakash 2015
Meme-Tracker– case study • 96,000 node graph for the meme “State of the economy” • Found missing websites like “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts. Sundareisan, Vreeken, Prakash 2015
Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Sundareisan, Vreeken, Prakash 2015
Conclusions • Given: Graph and sampled infections Find: missing infections and culprits • Formulated the problem • Using MDL • Two-stage alternating optimization • Find best seeds given missing nodes • Find best missing nodes given seeds • NetFill • Subquadratic (near-linear in many cases) • Outperforms baselines in real and synthetic data NetFill on a grid Sundareisan, Vreeken, Prakash 2015