1 / 23

Counting Triangles and the Curse of the Last Reducer

Counting Triangles and the Curse of the Last Reducer. Siddharth Suri , Sergei Vassilvitskii Yahoo! Research. Presentation Nikos Stasinopoulos. The Social Network. Using Clustering Coefficient to identify cliques.

ura
Télécharger la présentation

Counting Triangles and the Curse of the Last Reducer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting Triangles and the Curse of the Last Reducer SiddharthSuri, Sergei VassilvitskiiYahoo! Research Presentation Nikos Stasinopoulos

  2. The Social Network

  3. Using Clustering Coefficient to identify cliques • In SN, nodes tend to cluster together.(Holland and Leinhardt, 1971;Watts and Strogatz, 1998) • Assume undirected graph G = (V, E), Γ(v) is v’s neighborhood Clustering Coefficient is the fraction of v’s neighbors which are neighbors themselves

  4. Calculating CC by counting edges cc (A) = 1 cc (B) = 1 cc (C) = 1/3 cc (D) = N/A

  5. Calculating CC by counting triangles cc (A) = 1 cc (B) = 1 cc (C) = 1/3 cc (D) = N/A Again,…

  6. How to count triangles – Naiveapproach A Sequential Node Algorithm Pivot around each node Examine every pair of neighbors Count each triangle 6 times Quadratic,even for one high degree node Running Time :

  7. Improve upon the NodeIterator The improved version Pivot around low-degree nodes Result: Count each a triangle only once and, more importantly, consider far fewer 2-paths which is optimal [Shank, Th., 2007] Running Time :

  8. Why parallelize algorithms? • Graph Data Structure doesn’t fit in memory of a single machine • A sample Twitter graph has 42 million nodes and 2.4 billion edges ~ 4.5GB of compressed data. • When inside algorithm, computation of 2-paths explodes memory demand to petabytes

  9. Map Reduce Framework

  10. Advantages of MapReduce • Runs on commodity hardware • Non-critical failures • Widely used at: Yahoo!, Google, Facebook, MS (12/10) • Provided by cloud services such as AmazonWS • Open source

  11. MR -NodeIterator Round 1: Generate all possible 2-paths starting from each node Round 2: Check if 2-paths and starting node form a triangle

  12. MR -NodeIterator • Round 1: • Map 1: For each emit to reducer 1 • Reduce 1: Input • Output: all possible whereExample: (A,B);C - (A, D);C - (B, D);C Split input to reducers Formulate 2-paths Symbol denotes existence of neighbor edge • Round 2: • Map 2: Send and to reducer2 • Reduce 2: Input • Output: if exists, then count • Example: (A,B);C,

  13. MR-NodeIterator++ Pivot around the node with lower degree Input to Red1 is Output is Reducer2 input contains entire edge list and is Result: Count each triangle only once

  14. Data Skew – The Curse In context, there exist nodes with a high degree. Reducer with node @BarackObama (~10M followers) has to check 100 Trillion 2-paths using the naive approach. Natural Graphs commonly follow power law degree! The curse of 99% Complete

  15. Lifting the Curse Splitting nodes to low-, and high-degree (ieNodeIterator++) • |L| is at most n and each low–degree node generates paths • |H| is at most and each high–degree node generates paths • Finally, total work is

  16. Tackling the Curse – Graph Partitioning • The authors suggest partitioning the Graph. • is the induced subgraph • A contains 3/ρvertices and edges • Every triangle appears at least at one subgraph, possibly in more, so weights are introduced to scale the count

  17. In how many subgraphs a triangle appears? • Assume G is divided in ρ=4 • Case 1:Triangle’s vertices lie in distinct subsets , appears once • Case 2:Two vertices in the same subset,triangle appears ρ-2 times • Case 3:All three nodes in one subset,see line #15

  18. MR-GraphPartition Hash function distributes vertices to buckets Total size of Map output is Input size is Case 1 Case 3 Case 2 Scale #appearances

  19. The partitioning (ρ) parameter • Total size of Map output is • Input size for each Reducer is • This calls for a tradeoff.Increasing total disk memory for the Mappers,greatly decreases RAM req. for Reducers. • Again, total work is , distributed to Reducers.

  20. Results • Completion time distributes “normally” across the runtime spectrum

  21. ρ Tradeoff

  22. Contributions • Introduces MapReduce on counting triangles, even for the naive approach • Provides Graph Partition MR algorithm, extendable to other than triangles subgraphs • Implements some of Schank’s work in the context of Social Networks • Explores challenges in real-world data (data skew) • Results are exact, not approximations

  23. Thank you!

More Related