Joseph Gonzalez Joint work with

@ The Next Generation of the GraphLab Abstraction. Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola

How will wedesign and implementparallel learning systems?

... a popular answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions

MapReduce – Map Phase 4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase 8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Image Features

MapReduce – Map Phase 6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

MapReduce – Reduce Phase Attractive Face Statistics Ugly Face Statistics 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Image Features

Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? Map Reduce Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks

Concrete Example Label Propagation

Label Propagation Algorithm • Social Arithmetic: • Recurrence Algorithm: • iterate until convergence • Parallelism: • Compute all Likes[i] in parallel Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 80% Cameras 20% Biking 40% + I Like: 60% Cameras, 40% Biking Profile 50% 50% Cameras 50% Biking Me Carlos 30% Cameras 70% Biking 10%

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like

Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Map Reduce? ? Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks

Why not use Map-Reducefor Graph Parallel Algorithms?

Data Dependencies • Map-Reduce does not efficiently express dependent data • User must code substantial data transformations • Costly data replication Independent Data Rows

Iterative Algorithms • Map-Reduce not efficiently express iterative algorithms: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Slow Processor Data Data Data Data Data Barrier Barrier Barrier

MapAbuse: Iterative MapReduce • Only a subset of data needs computation: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier

MapAbuse: Iterative MapReduce • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Startup Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data

Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph)? Map Reduce? SVM Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks

Pregel (Giraph) • Bulk Synchronous Parallel Model: Compute Communicate Barrier

Problem Bulk synchronous computation can be highly inefficient. Example:Loopy Belief Propagation

Problem with Bulk Synchronous • Example Algorithm: If Red neighbor then turn Red • Bulk Synchronous Computation : • Evaluate condition on all vertices for every phase 4 Phases each with 9 computations  36 Computations • Asynchronous Computation (Wave-front) : • Evaluate condition only when neighbor changes 4 Phases each with 2 computations  8 Computations Time 0 Time 2 Time 3 Time 4 Time 1

Data-Parallel Algorithms can be Inefficient Optimized in Memory Bulk Synchronous Asynchronous Splash BP The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.

The Need for a New Abstraction • Map-Reduce is not well suited for Graph-Parallelism Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph) Feature Extraction Cross Validation Belief Propagation Kernel Methods SVM Computing Sufficient Statistics Tensor Factorization PageRank Lasso Neural Networks Deep Belief Networks

What is GraphLab?

The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights

Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scopeof the vertex label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }

The Scheduler The scheduler determines the order that vertices are updated. b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 The process repeats until the scheduler is empty.

The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

Ensuring Race-Free Code • How much can computation overlap?

GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. time CPU 1 Parallel CPU 2 Single CPU Sequential

Consistency Rules Full Consistency Data Guaranteed sequential consistency for all update functions

Full Consistency Full Consistency

Obtaining More Parallelism Full Consistency Edge Consistency

Edge Consistency Edge Consistency Safe Read CPU 1 CPU 2

Consistency Through R/W Locks • Read/Write locks: • Full Consistency • Edge Consistency Write Write Write Canonical Lock Ordering Read Read Write Read Write

GraphLab for Natural Graphs the Achilles heel

Problem: High Degree Vertices • Graphs with high degree vertices are common: • Power-Law Graphs (Social Networks): • Affects algorithms like label-propagation • Probabilistic Graphical Models: • Hyper-parameters which couple large sets of data • Connectivity structure induced by natural phenomena • High degree vertices kill parallelism: Pull a Large Amount of State Requires Heavy Locking Processed Sequentially

Proposed Four Solutions • Associative Commutative Update Functors • Transition from stateless to statefulupdate functions • Decomposable Update Functors • Expose greater parallelism by further factoring update functions • Abelian Group Caching (concurrent revisions) • Allows for controllable races through diff operations • Stochastic Scopes • Reduce degree through sampling

Associative Commutative Update Functors • Associative commutative functions which: i.e., 1 + (2 + 3) • Subsumes Message Passing, Traveler Model, and Pregel (Giraph) • Function composition equivalent to combiner in Pregel LabelPropUF(i, scope,Δ) // Get Neighborhood data (OldLikes[i], Likes[i], Wij) scope; // Update the vertex data Likes[i]  Likes[i] + Δ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ  (Likes[i] – OldLikes[i]) * Wij; reschedule(j,Δ); OldLikes[i]  Likes[i] Update Functor composition: • Achieved by defining anoperator+ for update functors • Update functors composition may be computed in parallel

Advantages of Update Functors • Eliminates the need to read neighbor values • Reduces locking overhead • Enables task de-duplication and hinting: • User defined (+) can choose to eliminate orcompose • Dynamically control execution parameters: • priority(), consistency_level(), is_decomposable() … • Needed to to enable further extensions • Decomposable update functors etc… • Simplifies schedulers: • Many update functions  single update functor

Disadvantages of Update Functors • Typically constructed sequentially • Requires access to edge weights • Problems exist in Pregel/Giraph as well • Changed required substantial engineering effort  • Affected 2/3 of the GraphLab Library lpF(i, scope,Δ) // Update the vertex data Likes[i]  Likes[i] + Δ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ  (Likes[i] – OldLikes[i]) * Wij; reschedule(j,Δ); OldLikes[i]  Likes[i]

Decomposable Update Functors Breaking computation over edges of the graph.

Decomposable Update Functors • Decompose update functions into 3 phases: • Locks are acquired only for region within a scope  Relaxed Consistency Gather Apply Scatter Scope Y Y Y Y Apply the accumulated value to center vertex Parallel Sum + + … +  Δ Update adjacent edgesand vertices. Y Y Y Y User Defined: User Defined: User Defined: Apply( , Δ)  Gather( )  Δ Scatter( )  Y Y Δ1 + Δ2 Δ3

Decomposable Update Functors • Implementing Label Propagation with Factorized Update Functions: • Implemented using update functors state as accumulator: • Uses same (+) operator to merge partial gathers: Scatter(i, scope){ // Get Neighborhood data (Likes[i], Wij,Likes[j]) scope; // Reschedule if changed if Likes[i] changed then reschedule(j); } Gather(i,j, scope) // Get Neighborhood data (Likes[i], Wij,Likes[j]) // Emit accumulator emit Wij x Likes[j] as Δ; Apply(i, scope, Δ){ // Get Neighborhood (Likes[i]) scope; // Update the vertex Likes[i]  Δ; } Δ1 + Δ2 Δ3 ( + )( ) F1 F2 F1 F2 Y Y

Decomposable Functors • Fits many algorithms • Loopy Belief Propagation, Label Propagation, PageRank… • Addresses the earlier concerns • Problem:High degree vertices will still get locked frequently and will need to be up-to-date Large State Heavy Locking Sequential Parallel Gather and Scatter Fine Grained Locking Distributed Gather and Scatter

Abelian Group Caching Enabling eventually consistent data races

Abelian Group Caching • Issue: • All earlier methods maintain a sequentially consistent view of data across all processors. • Proposal: Try to split data instead of computation. • How can we split the graph without changing the update function? CPU 4 CPU 3 CPU 1 CPU 2 CPU 2 CPU 3 CPU 4 CPU 1

Insight from WSDM paper • Answer: Allow Eventually Consistent data races • High degree vertices admit slightly “stale” values: • Changes in a few elements  negligible effect • High degreevertex updates typically a form of “sum” operation which has an “inverse” • Example: Counts, Averages, Sufficient statistics • Counter Example: Max • Goal:Lazily synchronize duplicate data • Similar to a version control system • Intermediate values partially consistent • Final value at termination must be consistent

Example • Every processor initial has a copy of the same central value: Master 10 10 10 10 Current Current Current Processor 1 Processor 2 Processor 3

Joseph Gonzalez Joint work with