1 / 24

EDA with Graphs

EDA with Graphs. Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003. Introduction. Some suggestions about looking at graphs Our way of analyzing graphs: COI

Jimmy
Télécharger la présentation

EDA with Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDA with Graphs Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003

  2. Introduction • Some suggestions about looking at graphs • Our way of analyzing graphs: COI • Two motivating examples • Challenges for the room Main point – sometimes EDA is all you need!

  3. Preaching to the choir… • Visualize, even when you can’t • Speech example • Learn a little graph theory, even if you don’t want to • Expand your toolbox with: • bridges • cutpoints • centroids • pseudo cliques • strongly connected components • Etc. • Look at node and edge variables, even if they are not there • Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)

  4. Our data • Huge! Hundreds of millions of nodes and edges, mostly connected • Modelling, or even EDA, on the entire graph may not be possible • COI – Communities of Interest are one way of analyzing these data • Storage - Break it down • Analysis – Build up from signatures • Updating - Through time via exponential smoothing

  5. Storage - Break it down • Consider the atomic units of the graph, which we call a COI signature: • For every node in the graph, store • Top k numbers inbound • Top k numbers outbound • Weights on each edge • overflow bin • In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.

  6. Analysis – Build up from signatures • Fraud – we build signatures • When, how long, but not to whom • We use the COI signature to build a Community of Interest for everyone, and then use that for analysis • Example • Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.

  7. Updating through time • our graph is dynamic • 3M new/old number per week! • We use an exponentially weighted moving average as a way to smoothly update through time…

  8. Two motivating examples • Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling • Viral Marketing • Fraud

  9. Viral Marketing plans • Viral Marketing – let your customers sell for you • COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis • We can also track through time • What did we do? • For the enrollees, find the induced subgraph from their COI • Look at a control group

  10. Cluster results… Lets look at some…

  11. what’s up with the big cluster?

  12. RDD: Repetitive Debtors Database • Lots of people cant pay their bill, but they want phone service anyway:

  13. Connect pool (30 Days) T restricts RDD Process • A big matching problem…. • Every day • we get restricted TNs, 4K / day • we get connected TNs 40K / day • Look over a 30 day period (possible 4B comparisons!) • Compare the COI graphs of the disconnected number and the new number… • We need a metric for graph distance

  14. TN-1 Connect TN-2 Restrict TN-3 TN-4 Connect TN-5 Matching Strategy • We use a combination of: • Intersection > 2 (to pare down) • Name/address overlap (to weed out) • $$ owed (to prioritize) • Here’s where modeling could help…or maybe not

  15. Wrap up • Viral Marketing • Used connected components of reduced data as ‘clusters’ • Looked for ‘centers’ of clusters for retention • Visualized clusters for understanding • Used boundary to predict new customers • COI was the best predictive variable in a marketing study • Fraud • Attacked massive matching via simple measures of distance • Fraud reps use visualized clusters to work cases • We detected RDD with an 80% success rate Is this EDA?

  16. Challenges • Viewing graphs through time • What if I don’t know what is coming next? • Graph distance metrics • What does “distance between graphs” mean? • Tools for looking at many graphs • what do union and intersection mean? • Modelling and EDA go hand in hand • Viral marketing models define network value, feed this into graph to do EDA….

  17. An answer for Duncan… • What do I want and who is going to do it? • Tools that combine: • Interactive capability • Graph operations • Statistical analysis • It’s happening • It’s great!! • It’s a little confusing This model works for me….do you agree?

  18. What I want…. • powerful ways to do union/intersection • unclear actually what that means • statistical measures of distances between graphs, what is the metric of interest, really? • use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest) • standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary • ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?

  19. Points to make • if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work! • sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user. • think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example…. • “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine • Modelling can be great – find pseudo edges, use latent space models,etc…

  20. Visualize, even when you cant • always a way to subset or threshold, or something • Speech example • learn some graph theoretics • bridge nodes/edges • Density, defs of cliques and pseudo cliques • dfs/bfs minimal spanning trees…. • Strongly conn comp • subset

  21. Storing COI Signatures • COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al) • Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion. • e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!

  22. B Z A O Informative overlap score • Calculate the “informative overlap” score: Where: wao = weight of edge from a to o wob = weight of edge from o to b wo= sum weight of edges to o dao, dob are the graph distances from a and b to o wob wao wo

  23. Selecting q Calls fade out over time; The larger q is , the longer the call has non-negligible weight

More Related