1 / 64

Mining Billion Node Graphs

Mining Billion Node Graphs. Christos Faloutsos CMU. CONGRATULATIONS!. Welcome to CMU!. Outline. Q+A Problem definition / Motivation Graphs and power laws Streams, environment, data center monitoring Conclusions. Q+A. Are you recruiting? How many? How many do you have?

mahaney
Télécharger la présentation

Mining Billion Node Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Billion Node Graphs Christos Faloutsos CMU

  2. CONGRATULATIONS! Welcome to CMU! C. Faloutsos

  3. Outline • Q+A • Problem definition / Motivation • Graphs and power laws • Streams, environment, data center monitoring • Conclusions C. Faloutsos

  4. Q+A • Are you recruiting? How many? • How many do you have? • How frequently you meet them? • What is your advising style? • How do you feel about summer internships? C. Faloutsos

  5. Q+A • Yes, 1-2 • 5+2 • 1/week •  • Yes/Maybe (Y!,G, MSR, IBM, ++) • Are you recruiting? How many? • How many do you have? • How frequently you meet them? • What is your advising style? • How do you feel about summer internships? C. Faloutsos

  6. Outline • Problem definition / Motivation • Graphs and power laws • Patterns and anomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos

  7. Motivation Temperature in datacenter • Data mining: ~ find patterns (rules, outliers) • How do real graphs look like? Anomalies? • Virus/influence propagation • Time series / env. Monitoring C. Faloutsos

  8. Graphs - why should we care? C. Faloutsos

  9. Graphs - why should we care? Friendship Network [Moody ’01] C. Faloutsos

  10. Graphs - why should we care? Food Web [Martinez ’91] Friendship Network [Moody ’01] Internet Map [lumeta.com] C. Faloutsos

  11. Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? • To spot anomalies (rarities), we have to discover patterns • Large datasets reveal patterns/anomalies that may be invisible otherwise… C. Faloutsos

  12. Graph mining • Are real graphs random? C. Faloutsos

  13. Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns C. Faloutsos

  14. Outline • Problem definition / Motivation • Graphs and power laws • Patternsandanomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos

  15. S1 – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? degree 3 C. Faloutsos

  16. WRONG ! count ?? count degree 3 3 S1– degree distributions • Q: avg degree is ~3 - what is the most probable degree? degree C. Faloutsos

  17. Solution: The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) Frequency Exponent = slope O = -2.15 -2.15 Nov’97 Outdegree C. Faloutsos

  18. Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles C. Faloutsos

  19. Solution# S.2: Triangle ‘Laws’ Real social networks have a lot of triangles Friends of friends are friends Any patterns? C. Faloutsos

  20. Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles C. Faloutsos

  21. Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions C. Faloutsos

  22. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 22

  23. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 23

  24. Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] C. Faloutsos 24

  25. But: • Q1: How about graphs from other domains? • Q2: How about temporal evolution? C. Faloutsos

  26. Time evolution • with Jure Leskovec (CMU -> Stanford) • and Jon Kleinberg (Cornell) (‘best paper’ KDD05) C. Faloutsos

  27. T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? C. Faloutsos

  28. T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: • diameter ~ O(log N) • diameter ~ O(log log N) • What is happening in real data? • Diameter shrinks over time • As the network grows the distances between nodes slowly decrease C. Faloutsos

  29. Diameter – ArXiv citation graph diameter • Citations among physics papers • 1992 –2003 • One graph per year time [years] C. Faloutsos

  30. Diameter – “Patents” diameter • Patent citation network • 25 years of data time [years] C. Faloutsos

  31. And many more patterns… • #nodes vs #edges (power law(!)) • # conn. Components (power law, too) • Contact/phone-call duration (log-logistic) • Total node weight vs # edges (super-linear/power law) • …. C. Faloutsos

  32. Outline • Problem definition / Motivation • Graphs and power laws • Patternsandanomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos

  33. E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] C. Faloutsos

  34. E-bay Fraud detection C. Faloutsos

  35. E-bay Fraud detection C. Faloutsos

  36. E-bay Fraud detection - NetProbe C. Faloutsos

  37. Popular press And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of your code?’) C. Faloutsos

  38. Outline • Problem definition / Motivation • Graphs and power laws • Patterns and anomalies • Scalability and ‘hadoop’ • Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions C. Faloutsos

  39. Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ C. Faloutsos

  40. details fork fork fork Mapper Output File0 write Split0 read Mapper Split1 Output File1 Split2 Mapper User Program Master assign map assign reduce InputData (onHDFS) Reducer local write Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) C. Faloutsos

  41. HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) C. Faloutsos

  42. HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) • Our HADI: linear on E (~10B) • Near-linear scalability wrt # machines • Several optimizations -> 5x faster C. Faloutsos

  43. ???? Count 19+ [Barabasi+] ~1999, ~1M nodes Radius C. Faloutsos

  44. ???? ?? Count 19+ [Barabasi+] ~1999, ~1M nodes Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. C. Faloutsos

  45. ???? Count 14 (dir.) ~7 (undir.) 19+? [Barabasi+] Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. C. Faloutsos

  46. ???? Count 14 (dir.) ~7 (undir.) 19+? [Barabasi+] Radius • YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • 7 degrees of separation (!) • Diameter: shrunk C. Faloutsos

  47. ???? Count ~7 (undir.) Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? C. Faloutsos

  48. YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality (?!) C. Faloutsos

  49. Radius Plot of GCC of YahooWeb. C. Faloutsos

  50. YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores . C. Faloutsos

More Related