1 / 91

Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation

Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation. B. Aditya Prakash Computer Science Virginia Tech. ITA Workshop, San Diego, Feb 5, 2016. Networks are everywhere!. Facebook Network [2010]. Gene Regulatory Network [ Decourty 2008].

dwesner
Télécharger la présentation

Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation B. Aditya Prakash Computer Science Virginia Tech. ITA Workshop, San Diego, Feb 5, 2016

  2. Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2016

  3. Dynamical Processes over networks are also everywhere! Prakash 2016

  4. Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology ........ Prakash 2016

  5. Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] SI Model CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016

  6. Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016

  7. Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016

  8. Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016

  9. Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2016

  10. Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016

  11. High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016

  12. Research Theme ANALYSIS Understanding POLICY/ ACTION Managing/Utilizing DATA Large real-world networks & processes Prakash 2016

  13. Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016

  14. Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016

  15. In this talk Q1: How to find hidden culprits? Q2: How to segment multi-dimensional sequences? DATA Large real-world networks & processes Prakash 2016

  16. Outline • Motivation • Part 1: Learning Models (Empirical Studies) • Q1: How to find hidden culprits? • Q2: How to segment data sequences? • Conclusion Prakash 2016

  17. Culprits Motivation • Patient zeroes • Who started the epidemic? • Rumors • Who started the rumor? Prakash 2016

  18. But: Real data is noisy! We don’t know who exactly are infected • Epidemiology • Public-health surveillance CDC Lab Hospital Not sure ? CNN headlines Surveillance Pyramid [Nishiura+, PLoS ONE 2011] ? Not sure Each level has a certain probability to miss some truly infected people Prakash 2016

  19. Real data is noisy! Correcting missing data is by itself very important • Social Media • Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Prakash 2016

  20. Outline • Motivation---Introduction • Problem Definition • Our Approach • Experiments • Conclusion Prakash 2016

  21. The Problem [Sundareisan, Vreeken, Prakash 2015] • GIVEN: • Graph G(V, E) from historical data • Infected set D V, sampled (p%) and incomplete • Infectivity β of the virus (assumed to follow the SI model) • FIND: • Seed set i.e. patient zeros/culprits • Set C- (the missing infected nodes) • Ripple R (the order of infections) Prakash 2016

  22. Related Work – Culprits (Partial) • Shah and Zaman, IEEE TIT, 2011 • One seed. • Provably finds MLE seed for d-regular trees • SI process • Lappas et. al., KDD, 2010. • k seeds (takes in Input k) • Infected graph assumed to be in steady-state • IC model • Prakash et. al., ICDM, 2012. (NetSleuth) • Finds number of seeds automatically. • Assumes no mistakes in infected set D. Prakash 2016

  23. Related Work – Missing Nodes (Partial) • Costenbader and Valente 2003; Kossinets 2006, Borgatti et al. 2006 • Study the effect of sampling on macro levelnetworkstatistics • Adiga et. al. 2013 • Sensitivity of total infections to noise in network structure • Sadikov et al., WSDM, 2011 • correct for sampling for macro level cascade statistics Prakash 2016

  24. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016

  25. MDL-Minimum Description Length Principle • Occam’s Razor • Simplest model is the best model • “Induction by Compression” • Related to Bayesian approaches • MDL cost in bits = Model cost + Data cost • Best model least cost in bits Data + Model Channel Prakash 2016 Sender Receiver

  26. MDL Encoding For Our Problem The Model Seeds (S), Ripple (R) Missing nodes (C-) Sender Receiver Graph G(V, E) Infectivity (β) Sampling (p) Seeds (S) Infected set (D C-) Ripple (R) Missing nodes (C-) Graph G(V, E) Infectivity (β) Sampling (p) Data given model Prakash 2016

  27. Model (S, R) Cost • Scoring the seed set (S) • Scoring the ripple? Number of possible |S|-sized sets En-coding integer |S| Prakash 2016

  28. Model (S, R) Cost • Scoring a ripple (R) Infected Snapshot Original Graph Ripple R1 Ripple R2 Prakash 2016

  29. Model (S, R) Cost • Ripple cost Ripple R How the ‘frontier’ advances How long is the ripple Prakash 2016

  30. Cost of the data (C-) • We have to transmit the missed nodes C- (green nodes) • So that receiver can recover D Detail:γ = 1 – p i.e. the probability of a node to be truly missing Prakash 2016

  31. Total MDL Cost • Finally • And our problem is now • Find S, R, C- to minimize it Prakash 2016

  32. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016

  33. Our Approach: Decoupling • The two problems are • Finding seeds/ripple (S, R) • Finding Missing nodes (C-) • Can we decouple them? Prakash 2016

  34. Decoupling the problems (contd.) • Finding seeds depends on missing nodes. Legend Missing nodes Seed Infected node NetSleuth: correct missing nodes filled in as input NetSleuth: No missing nodes as Input Prakash 2016

  35. Decoupling the problems (cont.) • Finding missing nodes also depends on seeds. Not Infected Infected Most probably A was missed B Seed S A Prakash 2016

  36. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016

  37. Finding missing nodes (S) and culprits (C-) • Suppose an oracle gives us the missing nodes (C-) • We have complete infected set (D U C-) • Apply NetSleuth directly • NO SAMPLING INVOLVED • Will give us the seed set Legend Missing nodes Seed Infected node * Prakash et. al., ICDM 2012 Applying NetSleuth* on Oracle’s Answer Prakash 2016

  38. Outline • Motivation---Introduction • Problem Definition • Our Approach • MDL • Decoupling • Finding S given C • Finding C given S • Experiments • Conclusion Prakash 2016

  39. Missing Nodes (C-) given (S) • Oracle gives us S, find C- • Naïve Approach? • Find all possible C- • Pick the best one according to MDL • Infeasible! ( sets) Prakash 2016

  40. Our Approach • Sub-problem 1: |Seeds| = 1 • |Missing nodes| = 1 • Sub-problem 2: Finding the right number of missing nodes. • Sub-problem 3: |Seeds| > 1 Prakash 2016

  41. Sub Problem 1: Best hidden culprit given one seed • Best node is one which makes the Seed s more likely • We use empirical risk as the measure • Sanity Check: ideally risk should be 0 • So best hidden culprit, Prakash 2016

  42. Sub-Problem 1: Best Hidden Culprit • Using some results in Prakash et. al. 2012 (see details in paper), we can rewrite it as u1 is the eigenvector corresponding to the smallest eigenvalue of the Laplaciansubmatrixof D Prakash 2016

  43. Detour: LaplacianSubmatrix • Laplacian = Deg(G) – A(G) • LD = take only rows for nodes in D (Laplaciansubmatrix) • u1 (smallest eigenvalue’s eigenvector) Laplacian Degree Adjacency Laplacian LaplacianSubmatrix D ƛ Eigenvector Prakash 2016

  44. Okay • How to solve this quickly? Proof Omitted: see paper Prakash 2016

  45. Best hidden hazard • Choose n* such Measures • how connected a node n is to centrally located infected nodes w.r.t. s in D • Depends on the seed as well as the structure Prakash 2016

  46. Sub-Problem 2: How many missing nodes? • MDL? • Add nodes based on Z-scores till MDL increases. • MDL is not convex! • But it has convex like behavior….. Prakash 2016

  47. Sub-Problem - 3: What if |Seeds| > 1 SKIP! Using z-scores: Missing nodes are near one seed Ideal: Missing nodes near both seeds Prakash 2016

  48. Sub problem 3: What if |Seeds| > 1 SKIP! • Exonerate previous seeds • Make previous seeds uninfected and calculate u1 • The blame is transferred to the locality of the older seed • Complete Z-score = maxover all seeds Z-score (n) • Maximum as we need high quality missing nodes • Take nodes with top-k complete Z-scores Prakash 2016

  49. Finding missing nodes given seeds Phew! Prakash 2016

  50. The complete algorithm – NetFill (Outline) Running time: sub-quadratic in practice Prakash 2016

More Related