Graph Analysis with High-Performance Computing

Graph Analysis with High-Performance Computing Jonathan Berry Scalable Algorithms Department Sandia National Laboratories Plenary at SIAM-PP 2010 (Seattle) February, 2010

Outline • Graph analysis • Non-HPC Theory & Algorithms • Role in “Informatics” • HPC Challenges • Traditional HPC applications • Informatics/Graph applications • Performance • MPP • Massive Multithreading • Hadoop (Cloud) • Thoughts and Conclusions

1 1 2 c h b 0 a 3 i e f g 2 d 1 1 2 Graph Analysis: What is it? • A graph is a set of vertices and a set of edges • This abstraction is a powerful modeling tool • Edges are “meta-data” – storing this is economical and often sufficient • Motivations in homeland security, computer security, biology, etc. • Typical graph algorithm behavior: • Nodes visit their neighbors • Algorithms traverse and compute • There’s a large amount of communication relative to computation • Often this is asynchronous

Graph Analysis: What is it not? (Graph Theory!) • Bondy & Murty Chapter 1 starred exercise example: • Show that if G is simple and the eigenvalues of A are distinct, then the automorphism group of G is abelian. • However, we’ll certainly use the results of good theory • If it leads to good algorithms, • These have application, • And can be implemented, • And can be made to run efficiently on HPC! • Often, the algorithms we want to run are simple • S-T connectivity • Connected components • single-source shortest paths These may be simple, but their implications for HPC are not.

“Informatics” • What it means in Europe: “Computer Science” • What it means in some MIS departments: “Microsoft Office” • What it means to me: The analysis of datasets arising from “information” sources such as the WWW or fragments of a genome (not physics simulation) • What it means to you: In all probability, something different

Some Non-Trivial Informatics Algorithms • Connection subgraphs (Faloutsos, et al., KDD 2004) • Community detection (100’s of papers) • Inexact subgraph isomorphism (several)

Application Characteristics:Traditional HPC Physical Simulations (System Software Person’s View) • Large-scale physics and engineering applications • Able to utilize the entire Red Storm machine • Basis in physical world leads to natural 3-D data distribution • Communication to nearest neighbors, in 3 dimensions • Peers limited even for adaptive mesh refinement • Ghost cell update messages sent from packed buffers • MPI historically bad at sending derived datatypes • Poor message rates led to buffering non-contiguous slices of the 3-D data space • Point-to-point communication largely ghost cell updates, range in size from word-length to megabytes • Collective communication • Double precision floating point all-reduce • Varied sized broadcasts, particularly during startup • Ghost cell updates implicitly synchronize time-steps Slide credit: Brian Barrett (Sandia)

Application Characteristics Simplified The HPC community is used to datasets with geometry, like these But datasets in graph analysis often look like this From Attaway ‘00 From UCSD ‘08

Informatics Datasets Are Different “One of the interesting ramifications of the fact that the PageRank calculation converges rapidly is that the web is an expander-like graph” Page, Brin, Motwani,Winograd 1999 From UCSD ‘08 Primary HPC Implications: 1) Any partitioning is “bad” 2) Data must inform the algorithm Broder, et al. ‘00

Informatics Problems Suggest New Architectures

What we traditionally care about Emerging Codes What industry cares about From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007 HPC Applications: Locality Challenges Graph Analysis Codes Slide Credit: Richard Murphy (SNL) and Peter Kogge (Notre Dame)

HPC Applications: Locality and Instruction Mix Slide credit: Arun Rodrigues (SNL)

Pause for Thought • Locality slide suggests that informatics applications are fundamentally different from traditional computational science applications • Instruction breakdown slide suggests that they may not be that different Which is it? Do we need to start over with different computer archtectures? Would we have to do this even if nobody cared about graphs?

Space Usage in the AMD Opteron Memory (Latency Avoidance) FPU Execution Load/Store Unit L1 D-Cache Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) L2 Cache Int Execution Bus DDR HT I-Fetch Scan Align L1 I-Cache Memory and I/O Interfaces Memory Controller COMPUTER Slide credit: Thomas Sterling (LSU/CalTech) and Richard Murphy (SNL)

The Burton Smith/Cray “Threadstorm” Processor No Processor Cache; Chip real estate goes to thread contexts Latency Tolerant: important for Graph Algorithms The Cray XMT combines up to 8192 Threadstorm processors with the Red Storm (Cray XT4/5) network. Hashed Memory

Performance Comparisons • We’ll look at three simple graph algorithms: • S-T connectivity • Connected components • Single-source shortest paths • We’ll find the following • S-T: multithreading looks good (but we’ve used the wrong type of data) • Connected components: if we learn from the data, Traditional HPC still scales • Single-source shortest paths: multithreading looks good again • ALL: for the amount of work done, traditional processor efficiency is low

A Simple Graph Algorithm: S-T Connectivity • Find a shortest path between vertices s and t • Imagine how this would be done in an SQL Database • With a graph representation, it’s simple and quick

S-T Performance Results • The following slide compares performance results on easy data (E-R graphs) • Studies on realistic data upcoming • 2005 LLNL/IBM Gordon Bell finalist entry used Erdos-Renyi graphs • “PBGL” is the “Parallel Boost Graph Library” (Indiana U.) • Inside track to be standard for graph analysis on clusters • “Generic programming” model is flexible • We are working to provide a similar model on the XMT (“MTGL”) • This is a “Bulk-Synchronous Parallel” (BSP) computation

S-T Connectivity Performance on HPC XMT 4p XMT 16p XMT Strong Scaling BlueGene/L 32,000p (2005) PBGL (on 2GHz Cluster) w/ghost nodes 96p 120p XMT 512p Thanks to David Mizell (Cray) for 512p XMT runs

Parallelized Shiloach-Vishkin (Connected Components) edge **edges = g->edges.getStore(); while (graft) { numiter++; graft = 0; #pragma mta assert parallel for (int i=0; i<m; i++) { int u = edges[i]->vert1->id; int v = edges[i]->vert2->id; if (D[u] < D[v] && D[v] == D[D[v]]) { D[D[v]] = D[u]; graft = 1; } if (D[v] < D[u] && D[u] == D[D[u]]) { D[D[u]] = D[v]; graft = 1; } } #pragma mta assert parallel for (int i=0; i<n; i++) { while (D[i] != D[D[i]]) D[i] = D[D[i]]; } } “Graft” “Hook” Each vertex chooses a representative; the latter are merged in “graft” steps MTA-2 implementation: Bader, Feo, Cong 2006

Connected Components: Look at the Data Matvec! Giant Connected Component (GCC) Find millions of small components via Shiloach-Vishkin or MapReduce (Cohen, 2008, Plimpton/Devine, 2008)

Connected Components Discussion • XMT version uses the MultiThreaded Graph Library (MTGL) • Open-source: https://software.sandia.gov/trac/mtgl • MapReduceMPI is a software framework to run MapReduce in distributed memory (allows out-of-core, maximizes in-core) • Open-source http://www.sandia.gov/~sjplimp/mapreduce.html • “Hybrid” does a Breadth-First Search (matvec) of the giant component, then finishes with MapReduce • Breadth-First Search is a Bulk-Synchronous Parallel (BSP) operation • Which will scale on power-law data on traditional HPC, if randomized “R-MAT” graphs: Chakrabarti, Zhan, Faloutsos, 2004

Connected Components results R-MAT (0.57,0.19,0.19,0.05) ~250 million edges Source: K. Devine, S. Plimpton, et al. MapReduceMPI (in preparation)

Memory Disk 1 2 1 2 2 1 Map Exchange Reduce Single-Source Shortest Paths (SSSP) Google demonstrates Map/Reduce on SSSP; what is the cost? • Case study: Dijkstra’s single-source shortest paths algorithm • Basic algorithm (much detail omitted) • Start with the source and “settle” it • “relax” edges leading out of most recently settled vertex • Find next closest vertex and settle it, goto 2. • Map/Reduce algorithm (much detail omitted) • Do a Map/Reduce pass to relax all edges (near the source or not) • If any node’s distance from the source decreased, goto 1. 2 1 0.5 2

∆ - stepping algorithm [Meyer, et al. 2003] Label-correcting algorithm: Can relax edges from unsettled vertices ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm” ∆: bucket width Vertices are ordered using buckets representing priority range of size ∆ Each bucket may be processed in parallel 0.05 0.56 3 6 0.07 0.01 0.15 0.23 4 2 0 0.02 0.13 0.18 5 1 Process “light” (red) edges in parallel, then “heavy” edges in parallel Slide credit: Kamesh Madduri, Lawrence Berkeley Laboratories

SSSP Discussion • Hadoop is open-source software for cloud computations • MapReduceMPI keeps problem mostly in memory • The “Parallel Boost Graph Library” implements Delta-Stepping on Traditional (distributed-memory) HPC • The XMT version of delta stepping (a C code by Kamesh Madduri) is the fastest known parallel implementation of SSSP. • Our experiments here are on realistic data with power-law degree distributions K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

SSSP Results seconds Processors ~0.5 billion edges, power-law ~30 million edges, power-law

Performance Summary and Lessons • Petabyte++ data massaging operations: cloud • E.g. create a graph from data • E.g. select a subgraph with a linear-time (or sub-linear) algorithm • 10-100+ terabyte Bulk-Synchronous-Processing (BSP) algorithms would run best (price/performance) on Traditional HPC • PBGL • MapReduceMPI • Massive multithreading offers large advantages for 1-10 terabyte graph problems • Especially if the algorithm isn’t BSP • Especially if sharing a database of persisent graphs among many users matters • Possible power advantage

Final Thoughts • Look for “MEGRAPHS” open-source software for XMT this year • graph database • Linear algebra primitives • (eventually) user-defined algorithm engines • Will lessons from HPC graph analysis apply to adaptive, traditional scientific computing? • Lots of people think so! Look for new architectures • Consider attending MS29: Analysis of Petascale Data via Networks • Where will big graphs come from? • What analyses will we want to apply? • Consider attending MS40: Parallel Algorithms and Software…Graphs • More detail on algorithms

Acknowledgements Simon Kahan (Self-employed (formerly Google (formerly Cray))) Petr Konecny (Google (formerly Cray)) Kristi Maschhoff (Cray) David Mizell (Cray) Mike Ringenburg (Cray) Bruce Hendrickson (SNL) Douglas Gregor (Indiana U.) Andrew Lumsdaine (Indiana U.) Nick Edmonds (Indiana U.) Jeremiah Willcock (Indiana U.) Vitus Leung (1415) Kamesh Madduri (LBL) William McLendon (SNL) Cynthia Phillips (SNL) Kyle Wheeler (Notre Dame/SNL) Brian Barrett (SNL) Steve Plimpton (SNL) Karen Devine (SNL) David Bader (Ga. Tech) John Feo (PNNL) Thomas Sterling (LSU/CalTech) Arun Rodrigues (SNL) (It was MS29 and MS40)

Graph Analysis with High-Performance Computing

Graph Analysis with High-Performance Computing

Presentation Transcript

HIGH PERFORMANCE COMPUTING

Graph Analysis With High-Performance Computing

High Performance Computing

High-Performance Computing

High-Performance Computing

Performance Analysis of Virtualization for High Performance Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High-Performance Computing

High Performance Computing

Parallel Computing With High Performance Computing Clusters (HPCs)

Data Analysis and High Performance Computing

High Performance Computing

High Performance Computing

HIGH-PERFORMANCE COMPUTING

Graph Analysis with High Performance Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High Performance Computing with R

Data Analysis and High Performance Computing

High Performance Computing