GraphLab A New Framework for Parallel Machine Learning

GraphLabA New Framework for Parallel Machine Learning Yucheng Low AapoKyrola Carlos Guestrin Joseph Gonzalez DannyBickson JoeHellerstein

Exponential Parallelism Exponentially Increasing Parallel Performance Exponentially Increasing Sequential Performance 13 Million Wikipedia Pages 3.6 Billion photos on Flickr Constant Sequential Performance Processor Speed GHz Release Date

Parallel Programming is Hard • Designing efficient Parallel Algorithms is hard • Race conditions and deadlocks • Parallel memory bottlenecks • Architecture specific concurrency • Difficult to debug • ML experts repeatedly address the same parallel design challenges Graduate students Avoid these problems by using high-level abstractions.

MapReduce – Map Phase 4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase 8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase 6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Reduce Phase 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Fold/Aggregation

Related Data Interdependent Computation: Not MapReduceable

Parallel Computing and ML • Not all algorithms are efficiently data parallel Data-ParallelComplex Parallel Structure Kernel Methods Belief Propagation Feature Extraction SVM Tensor Factorization Cross Validation Sampling Deep Belief Networks Neural Networks Lasso

Common Properties 1) Sparse Data Dependencies • Sparse Primal SVM • Tensor/Matrix Factorization 2) Local Computations • Sampling • Belief Propagation 3) Iterative Updates • Expectation Maximization • Optimization Operation A Operation B

Gibbs Sampling 1) Sparse Data Dependencies X2 X1 X3 2) Local Computations X4 X5 X6 3) Iterative Updates X7 X8 X9

GraphLab is the Solution • Designed specifically for ML needs • Express data dependencies • Iterative • Simplifies the design of parallel programs: • Abstract away hardware issues • Automatic data synchronization • Addresses multiple hardware architectures • Implementation here is multi-core • Distributed implementation in progress

GraphLab A New Framework for Parallel Machine Learning

GraphLab Data Graph Shared Data Table GraphLab Model Update Functions and Scopes Scheduling

Data Graph A Graph with data associated with every vertex and edge. X1 X2 X3 X4 x3: Sample value C(X3): sample counts X5 X7 X6 Φ(X6,X9): Binary potential X11 X8 X9 X10 :Data

Update Functions Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex

Update Function Schedule b d a c e f g CPU 1 a h i k h j a i CPU2 b d

Update Function Schedule b d a c e f g CPU 1 i k h j a i CPU2 b d

Static Schedule Scheduler determines the order of Update Function Evaluations Synchronous Schedule: Every vertex updated simultaneously Round Robin Schedule: Every vertex updated sequentially

Need for Dynamic Scheduling Converged Slowly Converging Focus Effort

Dynamic Schedule b d a c e f g CPU 1 b a h i k h j a b i CPU2

Dynamic Schedule Update Functions can insert new tasks into the schedule Residual BP [Elidan et al.] Splash BP [Gonzalez et al.] Wildfire BP [Selvatici et al.] Splash Schedule Priority Queue FIFO Queue Obtain different algorithms simply by changing a flag! --scheduler=fifo --scheduler=priority --scheduler=splash

Global Information What if we need global information? Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?

Shared Data Table (SDT) • Global constant parameters Constant: Temperature Constant: Total# Samples

Sync Operation • Sync is a fold/reduce operation over the graph • Accumulate performs an aggregation over vertices • Apply makes a final modification to the accumulated data • Example: Compute the average of all the vertices 1 Sync! 0 6 5 3 1 2 Add Accumulate Function: 8 Apply Function: Divide by |V| 2 9 22 1 3 2 1 1 1 2

Shared Data Table (SDT) • Global constant parameters • Global computation (Sync Operation) Sync: Loglikelihood Constant: Temperature Constant: Total# Samples Sync: Sample Statistics

SafetyandConsistency

Write-Write Race Write-Write Race If adjacent update functions write simultaneously Left update writes: Right update writes: Final Value

Race Conditions + Deadlocks • Just one of the many possible races • Race-free code is extremely difficult to write GraphLab design ensures race-free operation

Scope Rules Full Consistency Guaranteed safety for all update functions

Full Consistency Full Consistency Only allow update functions two vertices apart to be run in parallel Reduced opportunities for parallelism

Obtaining More Parallelism Full Consistency Not all update functions will modify the entire scope! Edge Consistency Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

Edge Consistency Edge Consistency

Obtaining More Parallelism Full Consistency Edge Consistency Vertex Consistency “Map”operations. Feature extraction on vertex data

Vertex Consistency Vertex Consistency

Sequential Consistency GraphLab guarantees sequential consistency For every parallel execution, there exists a sequential execution of update functions which will produce the same result. time CPU 1 Parallel CPU2 CPU1 Sequential

GraphLab Data Graph Shared Data Table GraphLab Model Update Functions and Scopes Scheduling

Experiments

Experiments • Shared Memory Implemention in C++ using Pthreads • Tested on a 16 processor machine • 4x Quad Core AMD Opteron 8384 • 64 GB RAM • Belief Propagation +Parameter Learning • Gibbs Sampling • CoEM • Lasso • Compressed Sensing • SVM • PageRank • Tensor Factorization

Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 vertices Sync: Edge-potential Update Function Belief Propagation Sync Acc: Compute inference statistics Apply:Take a gradient step

Graphical Model Learning 15.5x speedup on 16 cpus Optimal Better Splash Schedule Approx. Priority Schedule

Graphical Model Learning Standard parameter learning takes gradient only after inference is compute With GraphLab: Take gradient step while inference is running 2100 sec Parallel Inference + Gradient Step 3x faster! 700 sec Runtime Inference Simultaneous Iterated Gradient Step

Gibbs Sampling • Two methods for sequentially consistency: Scopes Edge Scope Scheduling Graph Coloring CPU 1 CPU 2 CPU 3 t0 t1 t2 t3 graphlab(gibbs, edge, sweep); graphlab(gibbs, vertex, colored);

Gibbs Sampling • Protein-protein interaction networks [Elidan et al. 2006] • Pair-wise MRF • 14K Vertices • 100K Edges • 10x Speedup • Scheduling reduceslocking overhead Better Optimal Round robin schedule Colored Schedule

CoEM (Rosie Jones, 2005) Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? the dog <X> ran quickly Australia travelled to <X> Catalina Island <X> is pleasant

CoEM (Rosie Jones, 2005) Optimal Better 6x fewer CPUs! 15x Faster! Large Small 46

Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed

Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed Finance Dataset from Kogan et al [2009].

Full Consistency Optimal Better Sparse Dense

GraphLab A New Framework for Parallel Machine Learning