1 / 25

Peta-Graph Mining

Peta-Graph Mining. Christos Faloutsos. Appel, Ana Chau, Polo Leskovec, Jure Kang, U. Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos. Our goal:. One-stop solution for mining huge graphs. Outline. Datasets: Synthetic (‘Kronecker’, ~300M nodes, 1B edges)

kizzy
Télécharger la présentation

Peta-Graph Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peta-Graph Mining Christos Faloutsos Appel, Ana Chau, Polo Leskovec, Jure Kang, U Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Yahoo/Hadoop, 2008

  2. Our goal: One-stop solution for mining huge graphs Yahoo/Hadoop, 2008

  3. Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

  4. Degree Distributions - NetFlix count Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008

  5. Degree Distributions - NetFlix count Theoretically expected Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008

  6. Degree Distributions - NetFlix count User out-degree 100 machines - 8min Yahoo/Hadoop, 2008

  7. Degree Distributions - NetFlix count Theoretically expected Sharp drop below 100 ratings User out-degree 100 machines - 8min Yahoo/Hadoop, 2008

  8. Degree Distributions - Kronecker count degree 100 machines - 6h Nodes:259M - Edges: 1B Yahoo/Hadoop, 2008

  9. Degree Distributions - timings Time (sec) 24 tasks 48 tasks 1 task Yahoo/Hadoop, 2008 Edge file size (MB)

  10. Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

  11. Diameter Diameter of a graph Maximum shortest path Normally, > O(N**2) ANF : `Approximate Neighborhood function’ [Palmer+02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Yahoo/Hadoop, 2008

  12. Diameter Time (min) 1 node 48 nodes 28 nodes Edge file (MB) • For large jobs, parallelization helps • Unstable results due to shared machines Yahoo/Hadoop, 2008

  13. Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops h: # of hops Yahoo/Hadoop, 2008

  14. Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops Diameter: 3 h: # of hops Yahoo/Hadoop, 2008

  15. Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

  16. Community detection Cross associations [Chakrabarti+ ’04] Yahoo/Hadoop, 2008

  17. Community detection Yahoo/Hadoop, 2008

  18. Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

  19. Triangles • ‘friends of friends are friends’ Yahoo/Hadoop, 2008

  20. Triangles • ‘friends of friends are friends’ Yahoo/Hadoop, 2008

  21. Triangles • ‘friends of friends are friends’ • Naïve algo: 3-way join (slow) • [Tsourakakis’08]: # triangles ~ sum of cubes of eigenvalues • Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; >95% accuracy Yahoo/Hadoop, 2008

  22. Triangles • Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos) Yahoo/Hadoop, 2008

  23. Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

  24. Visualization Principled visualization of large graphs (show few most `important’ edges) Yahoo/Hadoop, 2008

  25. Summary Goal: one-stop solution for mining huge graphs Yahoo/Hadoop, 2008

More Related