550 likes | 786 Vues
GraphLab A New Framework for Parallel Machine Learning. Yucheng Low Aapo Kyrola Carlos Guestrin. Joseph Gonzalez Danny Bickson Joe Hellerstein. Exponential Parallelism. Exponentially Increasing Parallel Performance. Exponentially Increasing Sequential Performance.
E N D
GraphLabA New Framework for Parallel Machine Learning Yucheng Low AapoKyrola Carlos Guestrin Joseph Gonzalez DannyBickson JoeHellerstein
Exponential Parallelism Exponentially Increasing Parallel Performance Exponentially Increasing Sequential Performance 13 Million Wikipedia Pages 3.6 Billion photos on Flickr Constant Sequential Performance Processor Speed GHz Release Date
Parallel Programming is Hard • Designing efficient Parallel Algorithms is hard • Race conditions and deadlocks • Parallel memory bottlenecks • Architecture specific concurrency • Difficult to debug • ML experts repeatedly address the same parallel design challenges Graduate students Avoid these problems by using high-level abstractions.
MapReduce – Map Phase 4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase 8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase 6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Reduce Phase 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Fold/Aggregation
Related Data Interdependent Computation: Not MapReduceable
Parallel Computing and ML • Not all algorithms are efficiently data parallel Data-ParallelComplex Parallel Structure Kernel Methods Belief Propagation Feature Extraction SVM Tensor Factorization Cross Validation Sampling Deep Belief Networks Neural Networks Lasso
Common Properties 1) Sparse Data Dependencies • Sparse Primal SVM • Tensor/Matrix Factorization 2) Local Computations • Sampling • Belief Propagation 3) Iterative Updates • Expectation Maximization • Optimization Operation A Operation B
Gibbs Sampling 1) Sparse Data Dependencies X2 X1 X3 2) Local Computations X4 X5 X6 3) Iterative Updates X7 X8 X9
GraphLab is the Solution • Designed specifically for ML needs • Express data dependencies • Iterative • Simplifies the design of parallel programs: • Abstract away hardware issues • Automatic data synchronization • Addresses multiple hardware architectures • Implementation here is multi-core • Distributed implementation in progress
GraphLab A New Framework for Parallel Machine Learning
GraphLab Data Graph Shared Data Table GraphLab Model Update Functions and Scopes Scheduling
Data Graph A Graph with data associated with every vertex and edge. X1 X2 X3 X4 x3: Sample value C(X3): sample counts X5 X7 X6 Φ(X6,X9): Binary potential X11 X8 X9 X10 :Data
Update Functions Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex
Update Function Schedule b d a c e f g CPU 1 a h i k h j a i CPU2 b d
Update Function Schedule b d a c e f g CPU 1 i k h j a i CPU2 b d
Static Schedule Scheduler determines the order of Update Function Evaluations Synchronous Schedule: Every vertex updated simultaneously Round Robin Schedule: Every vertex updated sequentially
Need for Dynamic Scheduling Converged Slowly Converging Focus Effort
Dynamic Schedule b d a c e f g CPU 1 b a h i k h j a b i CPU2
Dynamic Schedule Update Functions can insert new tasks into the schedule Residual BP [Elidan et al.] Splash BP [Gonzalez et al.] Wildfire BP [Selvatici et al.] Splash Schedule Priority Queue FIFO Queue Obtain different algorithms simply by changing a flag! --scheduler=fifo --scheduler=priority --scheduler=splash
Global Information What if we need global information? Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?
Shared Data Table (SDT) • Global constant parameters Constant: Temperature Constant: Total# Samples
Sync Operation • Sync is a fold/reduce operation over the graph • Accumulate performs an aggregation over vertices • Apply makes a final modification to the accumulated data • Example: Compute the average of all the vertices 1 Sync! 0 6 5 3 1 2 Add Accumulate Function: 8 Apply Function: Divide by |V| 2 9 22 1 3 2 1 1 1 2
Shared Data Table (SDT) • Global constant parameters • Global computation (Sync Operation) Sync: Loglikelihood Constant: Temperature Constant: Total# Samples Sync: Sample Statistics
Write-Write Race Write-Write Race If adjacent update functions write simultaneously Left update writes: Right update writes: Final Value
Race Conditions + Deadlocks • Just one of the many possible races • Race-free code is extremely difficult to write GraphLab design ensures race-free operation
Scope Rules Full Consistency Guaranteed safety for all update functions
Full Consistency Full Consistency Only allow update functions two vertices apart to be run in parallel Reduced opportunities for parallelism
Obtaining More Parallelism Full Consistency Not all update functions will modify the entire scope! Edge Consistency Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices
Edge Consistency Edge Consistency
Obtaining More Parallelism Full Consistency Edge Consistency Vertex Consistency “Map”operations. Feature extraction on vertex data
Vertex Consistency Vertex Consistency
Sequential Consistency GraphLab guarantees sequential consistency For every parallel execution, there exists a sequential execution of update functions which will produce the same result. time CPU 1 Parallel CPU2 CPU1 Sequential
GraphLab Data Graph Shared Data Table GraphLab Model Update Functions and Scopes Scheduling
Experiments • Shared Memory Implemention in C++ using Pthreads • Tested on a 16 processor machine • 4x Quad Core AMD Opteron 8384 • 64 GB RAM • Belief Propagation +Parameter Learning • Gibbs Sampling • CoEM • Lasso • Compressed Sensing • SVM • PageRank • Tensor Factorization
Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 vertices Sync: Edge-potential Update Function Belief Propagation Sync Acc: Compute inference statistics Apply:Take a gradient step
Graphical Model Learning 15.5x speedup on 16 cpus Optimal Better Splash Schedule Approx. Priority Schedule
Graphical Model Learning Standard parameter learning takes gradient only after inference is compute With GraphLab: Take gradient step while inference is running 2100 sec Parallel Inference + Gradient Step 3x faster! 700 sec Runtime Inference Simultaneous Iterated Gradient Step
Gibbs Sampling • Two methods for sequentially consistency: Scopes Edge Scope Scheduling Graph Coloring CPU 1 CPU 2 CPU 3 t0 t1 t2 t3 graphlab(gibbs, edge, sweep); graphlab(gibbs, vertex, colored);
Gibbs Sampling • Protein-protein interaction networks [Elidan et al. 2006] • Pair-wise MRF • 14K Vertices • 100K Edges • 10x Speedup • Scheduling reduceslocking overhead Better Optimal Round robin schedule Colored Schedule
CoEM (Rosie Jones, 2005) Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? the dog <X> ran quickly Australia travelled to <X> Catalina Island <X> is pleasant
CoEM (Rosie Jones, 2005) Optimal Better 6x fewer CPUs! 15x Faster! Large Small 46
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed
Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed Finance Dataset from Kogan et al [2009].
Full Consistency Optimal Better Sparse Dense