1 / 81

Leveraging Propagation for Data Mining Models, Algorithms, Applications

Leveraging Propagation for Data Mining Models, Algorithms, Applications. B. Aditya Prakash Department of Computer Science. Social Computing Workshop, ARL, Sept 28, 2016. Dynamical Processes over networks are also everywhere!. Why do we care?. Social collaboration Information Diffusion

tsally
Télécharger la présentation

Leveraging Propagation for Data Mining Models, Algorithms, Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leveraging Propagation for Data MiningModels, Algorithms, Applications B. Aditya Prakash Department of Computer Science Social Computing Workshop, ARL, Sept 28, 2016

  2. Dynamical Processes over networks are also everywhere! Prakash 2016

  3. Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology • ........ Prakash 2016

  4. Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016

  5. Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients • transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016

  6. Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016

  7. Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016

  8. Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2016

  9. Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016

  10. High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016

  11. Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2016

  12. Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016

  13. Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016

  14. In this talk Using propagation for _________ Q1: SyndromicSurveillance Q2: Memes, Tweets, Blogs Q3: Summarization & Communities. Applications Large real-world networks & processes Prakash 2016

  15. Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016

  16. Surveillance [Chen et. al. ICDM 2014] • How to estimate and predict flu trends? Surveillance Report Hospital record Lab survey Population survey Prakash 2016

  17. GFT& Twitter • Estimate flu trends using online electronic sources Prakash 2016

  18. Flu forecasting • Twitter – a surrogate for flu forecasting? • Google Flu Trends: using keywords to track the flu season • Can we get more specific? • Consider: Prakash 2016

  19. “Propagation” ideas • Can we develop better disease surveillance tools by leveraging • How flu-related information propagates on Twitter • Epidemiological models Prakash 2016

  20. Observation 1: States • There are different states in an infection cycle. • SEIR model: 1. Susceptible 2.Exposed 3. Infected 4.Recovered Prakash 2016

  21. Observation 2: Ep. & So. Gap • Infection cases drop exponentially in epidemiology (Hethcote 2000) • Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2016

  22. Flu Forecasting • Using combination of propagation patterns, develop a hidden flu-state topic model • Learn “flu” vocabulary and transition probabilities Prakash 2016

  23. Details HFSTM Model • Hidden Flu-State from Tweet Model (HFSTM) • Each word (w) in a tweet (Oi) can be generated by: • A background topic • Non-flu related topics • State related topics Latent state Initial prob. Transit. switch Binary non-flu related switch Transit. prob. Binary background switch Word distribution Prakash 2016

  24. Details HFSTM Model Generate the state for a tweet Generate the topic for a word • Generating tweets State: [S,E,I] Topic: [Background, Non-flu, State] good S: This restaurant is really E: The movie was good but was it freezing I: I think I have flu Prakash 2016

  25. Details Inference • EM-based algorithm: HFSTM-FIT • E-step: • At(i)=P(O1,O2,…,Ot,St=i) • Bt(i)=P(Ot+1,…,OTu|St=i) • γt(i)=P(St=i|Ou) • M-step: • Other parameters such as state transition probabilities, topic distributions, etc. • Parameters learned: Prakash 2016

  26. A possible issue with HFSTM • Suffersfrom large, noisy vocabulary. • Semi-supervision for improvement • Introduce weak supervision into HFSTM. Prakash 2016

  27. Details HFSTM-A [Chen et. al. DAMI 2015] • HFSTM-A(spect) • Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. • The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2016

  28. Vocabulary & Dataset • Vocabulary (230 words): • Flu-related keyword list by Chakraborty SDM 2014 • Extra state-related keyword list • Dataset (34,000 tweets): • Identify infected users and collect their tweets • Train on data from Jun 20, 2013-Aug 06, 2013 • Test on two time period: • Dec 01, 2012- July 08, 2013 • Nov 10, 2013-Jan 26, 2014 Prakash 2016

  29. Learned word distributions • The most probable words learned in each state Probably healthy: S Having symptons: E Definitely sick: I Prakash 2016

  30. Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2016

  31. Flu trend fitting • Ground-truth: • The Pan American Health Organization (PAHO) • Algorithms: • Baseline: • Count the number of keywords weekly as features, and regress to the ground-truth curve. • Google flu trend: • Take the google flu trend data as input, regress to the PAHO curve. • HFSTM: • Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2016

  32. Flu trend fitting • Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2016

  33. HFSTM-A • Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2016

  34. Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016

  35. Memetracking • Memes – a virally transmitted cultural symbol or social idea (first coined by Richard Dawkins in 1976) • Usually text (a phrase) and/or an image A viral meme from 2012 Olympics All the way to the White House Prakash 2016

  36. Patterns Anomaly Imputation Compression Extrapolation Prakash 2016

  37. Google Search Volume e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016

  38. Rise and fall patterns in social media • Meme (# of mentions in blogs) • short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2016

  39. Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on YouTube [Crane et al. ’08] • six classes on Meme [Yang et al. ’11] Prakash 2016

  40. Rise and fall patterns in social media • Answer: YES! • We can represent all patterns by single model In Matsubara+ SIGKDD 2012 Prakash 2016

  41. Main idea - SpikeM • 1. Un-informed bloggers (uninformed about rumor) • 2. External shock at time nb(e.g, breaking news) • 3. Infection(word-of-mouth) β Time n=0 Time n=nb Time n=nb+1 • Infectiveness of a blog-post at age n: • Strength of infection (quality of news) • Decay function (how infective a blog posting is) Power Law Prakash 2016

  42. J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) -1.5 slope Prakash 2016

  43. Details SpikeM - with periodicity • Full equation of SpikeM Periodicity 12pm Peak activity Bloggers change their activity over time (e.g., daily, weekly, yearly) 3am Low activity activity Time n Prakash 2016

  44. Tail-part forecasts • SpikeMcan capture tail part Prakash 2016

  45. “What-if” forecasting e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016

  46. “What-if” forecasting • SpikeM can forecast not only tail-part, but also rise-part! • SpikeMcan forecast upcoming spikes (1) First spike (2) Release date (3) Two weeks before release Prakash 2016

  47. Bonus: Protest Predictions Violent Protest (VP) [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] • Can Twitter provide a lead time? • South American twitter dataset • Language: Spanish/Portuguese • Idea • Look for trending keywords. • Predict event type for protest using SpikeM parameters! VP A political tweet Non Violent Protest (P) P Prakash 2016

  48. [Papalexakakis et al. ASONAM 2013] Propagation and Cyber-Security: Temporal Patterns Looks familiar?  Prakash 2016

  49. [Chan et. Al. WSDM 2016] Propagation and Cyber-Security: Ensemble Models Prakash 2016

  50. Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016

More Related