1 / 53

Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics

Roshan Dathathri Gurbinder Gill          Loc Hoang Hoang-Vu Dang          Alex Brooks          Nikoli Dryden Marc Snir           Keshav Pingali. Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics. 1. Distributed Graph Analytics. Applications:

dkunkel
Télécharger la présentation

Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Roshan DathathriGurbinder Gill          Loc Hoang Hoang-Vu Dang          Alex Brooks          Nikoli Dryden Marc Snir          Keshav Pingali Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics 1

  2. Distributed Graph Analytics Applications: machine learning and network science Datasets: unstructured graphs Need TBs of memory Credits: Wikipedia, SFL Scientific, MakeUseOf Credits: Sentinel Visualizer 2

  3. Gluon [PLDI’18] • Substrate: single address space applications on distributed, heterogeneous clusters • Provides:  • Partitioner • High-level synchronization API • Communication-optimizing runtime 3

  4. How to use Gluon? • Programmers: • Write shared-memory applications • Interface with Gluon using API • Gluon transparently handles: • Graph partitioning • Communication and synchronization CPU CPU Galois/Ligra/... Galois/Ligra/... Gluon Plugin Gluon Plugin Gluon Comm. Runtime Gluon Comm. Runtime Partitioner Partitioner Network (LCI/MPI) Network (LCI/MPI) Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] 4

  5. How to use Gluon? • Programmers: • Write shared-memory applications • Interface with Gluon using API • Gluon transparently handles: • Graph partitioning • Communication and synchronization GPU GPU IrGL/CUDA/... IrGL/CUDA/... Gluon Plugin Gluon Plugin Gluon Comm. Runtime Gluon Comm. Runtime CPU CPU Gluon Comm. Runtime Gluon Comm. Runtime Partitioner Partitioner Network (LCI/MPI) Network (LCI/MPI) Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] 4

  6. How to use Gluon? • Programmers: • Write shared-memory applications • Interface with Gluon using API • Gluon transparently handles: • Graph partitioning • Communication and synchronization GPU IrGL/CUDA/... Gluon Plugin Gluon Comm. Runtime CPU Galois/Ligra/... CPU Gluon Plugin Gluon Comm. Runtime Gluon Comm. Runtime Partitioner Partitioner Network (LCI/MPI) Network (LCI/MPI) Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] 4

  7. Contributions • Novel approach to build distributed and heterogeneous graph analytics systems out of plug-and-play components • Novel optimizations that reduce communication volume and time • Plug-and-play systems built with Gluon outperform the state-of-the-art GPU IrGL/CUDA/... Gluon Plugin Gluon Comm. Runtime CPU Galois/Ligra/... CPU Gluon Plugin Gluon Comm. Runtime Gluon Comm. Runtime Partitioner Partitioner Network (LCI/MPI) Network (LCI/MPI) Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] 5

  8. Outline • Gluon Synchronization Approach • Optimizing Communication • Exploiting Structural Invariants of Partitions • Exploiting Temporal Invariance of Partitions • Experimental Results

  9. Gluon Synchronization Approach

  10. Vertex Programming Model • Every node has one or more labels • e.g., distance in single source shortest path (SSSP) • Apply an operator on an active node in the graph • e.g., relaxation operator in SSSP • Operator: computes labels on nodes • Push-style: reads its label and writes to neighbors’ labels • Pull-style: reads neighbors’ labels and writes to its label • Applications: breadth first search, connected component, pagerank, single source shortest path, betweenness centrality, k-core, etc. R W push-style R W pull-style 6

  11. Partitioning Host h1 Host h2 B C D A F G H E I J Partitions of the graph Original graph 7

  12. Partitioning Host h1 • Each edge is assigned to a unique host Host h2 B C D A F G H E I J Partitions of the graph Original graph 7

  13. Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J Partitions of the graph Original graph 7

  14. Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host • A node can have multiple proxies: one is master proxy; rest are mirror proxies Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J : Master proxy : Mirror proxy Partitions of the graph Original graph 7

  15. Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host • A node can have multiple proxies: one is master proxy; rest are mirror proxies Host h2 5 2 1 C B C B B C 0 D A D A 3 0 1 4 6 G F G F F G 3 H E H E 6 2 J J I I J 5 4 7 A-J: Global IDs  : Master proxy 0-7: Local IDs  : Mirror proxy Partitions of the graph Original graph 7

  16. How to synchronize the proxies? • Distributed Shared Memory (DSM) protocols • Proxies act like cached copies • Difficult to scale out to distributed and heterogeneous clusters Host h1 Host h2 B C C B D A F G G F H E J J I : Master proxy : Mirror proxy 8

  17. How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized Host h1 Host h2 1 8 B C C B 0 8 D A F G G F 1 8 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A 9

  18. How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized • Use all-reduce: • Reduce from mirror proxies to master proxy • Broadcast from master proxy to mirror proxies Host h1 Host h2 1 B C C B 0 8 1 D A F G G F 1 1 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A 9

  19. When to synchronize proxies? Host h1 Host h2 Read partition-1 of the graph Read partition-2 of the graph Distribution-agnostic computation on local proxies Distribution-agnostic computation on local proxies Synchronize proxies with each other Synchronize proxies with each other 10

  20. Gluon Distributed Execution Model Host h1 Host h2 Gluon partitioner Gluon partitioner Galois/Ligra on multicore CPU or  IrGL/CUDA on GPU Galois/Ligra on multicore CPU or  IrGL/CUDA on GPU Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] Gluon comm. runtime Gluon comm. runtime MPI/LCI MPI/LCI 11

  21. Gluon Synchronization API • Application-specific:  • What: Label to synchronize • When: Point of synchronization • How: Reduction operator to use • Platform-specific:  • Access functions for labels (specific to data layout) 12

  22. Exploiting Structural Invariants to Optimize Communication

  23. Structural invariants in the partitioning Structural invariants in this partitioning: • Mirror proxies do not have outgoing edges As a consequence, for sssp: • Mirror proxies do not read their distance label • Broadcast from master proxy to mirror proxies is not required Host h1 Host h2 1 B C C B 0 8 D A F G G F 1 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A 13

  24. Graph as an adjacency matrix Graph view Adjacency matrix view 14

  25. Partitioning an adjacency matrix Host h1 Host h2 B C C B D A F G G F H E J J I : Master proxy : Mirror proxy Graph view Adjacency matrix view 15

  26. Partitioning strategies with 4 partitions Incoming Edge Cut (IEC)  Outgoing Edge Cut (OEC)  Cartesian Vertex Cut (CVC)  UnconstrainedVertex Cut (UVC)  16

  27. Partitioning: strategies, constraints, invariants • Algorithm invariant in SSSP: R W 17

  28. Exploiting Temporal Invariance to Optimize Communication

  29. Bulk-communication • Proxies of millions of nodes need to be synchronized in a round • Not every node is updated in every round • Address spaces (local-IDs) of different hosts are different • Existing systems: use address translation and communicate global-IDs along with updated values Host h1 Host h2 8 5 2 1 B C C B 0 2 D A 3 0 1 4 6 F G G F 3 8 2 H E 6 2 J J I 5 4 7 8 8 A-J: Global IDs  : Master proxy 0-7: Local IDs  : Mirror proxy : distance (label) from source A 18

  30. Bulk-communication in existing systems Host h1 Host h2 Host h1 Host h2 8 5 2 1 2 4 5 6 B C C B 0 2 2 2 2 8 8 2 D A 3 0 1 4 C G 2 4 C G 6 F G G F 2 2 2 2 2 2 3 8 2 H E 6 2 J J I 5 4 7 8 8 A-J: Global IDs  : Master proxy :Global IDs :Address translation 0-7: Local IDs  : Mirror proxy :Local IDs :Communication : Label : Reduction : distance (label) from source A 19

  31. Bulk-communication in Gluon • Elides address translation during communication in each round • Exploits temporal invariance in partitioning • Mirrors and masters are static • e.g., only labels of C, G, and J can be reduced from h1 to h2 • Memoize address translation after partitioning Host h1 Host h2 8 5 2 1 B C C B 0 2 D A 3 0 1 4 6 F G G F 3 8 2 H E 6 2 J J I 5 4 7 8 8 A-J: Global IDs  : Master proxy 0-7: Local IDs  : Mirror proxy : distance (label) from source A 20

  32. Optimization I: memoization of address translation Host h1 Host h2 Host h1 Host h2 5 2 1 2 4 5 6 7 5 B C C B 0 D A 3 0 1 4 C G J J C G 6 F G G F 3 H E 6 2 J J I 5 4 7 A-J: Global IDs  : Master proxy :Global IDs :Address translation 0-7: Local IDs  : Mirror proxy :Local IDs :Communication 21

  33. Communication after memoization • Memoization establishes the order of local-IDs shared between hosts • Send and receive labels in each round in the same order • Only some labels updated in a round: • Send simple encoding that captures which of the local-IDs in the order were updated Host h1 Host h2 2 4 5 6 7 5 :Local IDs 22

  34. Optimization II:encoding of updated local-IDs Host h1 Host h2 Host h1 Host h2 8 5 2 1 5 6 7 5 2 4 B C C B 0 2 2 8 2 2 8 8 8 2 D A 3 0 1 4 110 110 6 F G G F 2 2 3 2 2 8 2 H E 6 2 J J I 5 4 7 8 8 A-J: Global IDs  : Master proxy : Encoding :Bitvector 0-7: Local IDs  : Mirror proxy :Local IDs :Communication : Label : Reduction : distance (label) from source A 23

  35. Experimental Results

  36. Experimental setup • Benchmarks: • Breadth first search (bfs) • Connected components (cc) • Pagerank (pr) • Single source shortest path (sssp) • Systems: • D-Ligra (Gluon + Ligra) • D-Galois (Gluon + Galois) • D-IrGL (Gluon + IrGL) • Gemini (state-of-the-art) [OSDI’16] 24

  37. Strong scaling on Stampede (68 cores on each host) D-Galois performs best and scales well 25

  38. Strong scaling on Bridges (4 GPUs share a physical node) D-IrGL scales well 26

  39. Impact of Gluon’s communication optimizations D-Galois on 128 hosts of Stampede: clueweb12 Improvement (geometric mean):         Communication volume: 2x                                  Execution time: 2.6x 27

  40. Fastest execution time (sec) using best-performing number of hosts/GPUs D-Galois and D-IrGL are faster than Gemini by factors of 3.9x and 4.9x 28

  41. Subsequent work using Gluon • [EuroPar’18] Abelian compiler: shared-memory Galois apps ---> distributed, heterogeneous (D-Galois + D-IrGL) apps • [VLDB’18] Partitioning Policies: Cartesian Vertex Cut performs best at scale • [PPoPP’19] Min-Rounds Betweenness Centrality (MRBC) algorithm • [ASPLOS’19] Phoenix: fault-tolerance without overhead during fault-free execution • [IPDPS’19] Fast Customizable Streaming Edge Partitioner (CuSP) 29

  42. Conclusions • Novel approach to build distributed, heterogeneous graph analytics systems: scales out to 256 multicore-CPUs and 64 GPUs • Novel communication optimizations: improve execution time by 2.6x • Gluon, D-Galois, and D-IrGL: publicly available in Galois v4.0 http://iss.ices.utexas.edu/?p=projects/galois • Use Gluon to scale out your shared-memory graph analytical applications 30

  43. Backup slides

  44. Graph construction time (sec) 31

  45. Execution time (sec) on a single node of Stampede (“-” means out-of-memory) 32

  46. Execution time (sec) on a single node of Bridges with 4 K80 GPUs (“-” means out-of-memory). 33

  47. Fastest execution time (sec) of all systems using best-performing number of hosts/GPUs D-Galois and D-IrGL are faster than Gemini by factors of 3.9x and 4.9x on the average. 34

  48. Impact of Gluon’s communication optimizations Improvement (geometric mean): • Communication volume: 2x • Execution time: 2.6x D-Galois on 128 hosts of Stampede: clueweb12 with CVC 35

  49. Impact of Gluon’s communication optimizations D-Galois on 128 hosts of Stampede: clueweb12 with CVC Improvement (geometric mean): • Communication volume: 2x • Execution time: 2.6x 27

  50. Fastest execution time (sec) of all systems using best-performing number of hosts/GPUs clueweb12 D-Galois and D-IrGL are faster than Gemini by factors of 3.9x and 4.9x 28

More Related