Optimizing Graph Algorithms for Improved Cache Performance

Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park, Michael Penner, Viktor K Prasanna

1 2 The Problem with Graphs… Graph problems pose unique challenges to improving cache performance due to their irregular data access patterns. 99

Agenda • A recursive implementation of the Floyd-Warshall Algorithm. • A tiled implementation of the Floyd-Warshall Algorithm. • Efficient data structures for general graph problem. • Optimizations for the maximum matching algorithm.

Analysis model • All proofs and complexity analysis will be based on the I/O model. i.e: the goal of the improved algorithm is to minimize the number of cpu-memory transactions. CPU C B A cost(A) ≪ cost(B) cost(C) ≪ cost(B) Cache Main Memory

Analysis model All proofs will assume total control of the cache. i.e if the cache is big enough to hold two data blocks, than the two can be held in the cache without running over each other (no conflict misses)

The Floyd Warshall Algorithm • An ‘all pairs shortest path’ algorithm. • Works by iteratively calculating Dk, where Dk is the matrix of all pair shortest paths going through vertices {1, 2, …k}. • Each iteration depends on the result of the previous one. • Time complexity: Θ(|V|3).

The Floyd Warshall Algorithm Pseudo Code: for k from 1 to |V| for i from 1 to |V| for j from 1 to |V| Di,j(k) ← min {Di,j(k-1) , Di,k(k-1) + Dk,j(k-1) } return D(|V|)

The Floyd Warshall Algorithm The algorithm accesses the entire matrix in each iteration. The dependency of the kth iteration on the results of the (k-1)th iteration eliminate the ability to perform data reuse.

Lemma 1 Di,j(k) ← min {Di,j(k-1) , Di,k(k-1) + Di,k(k-1) } Suppose Di,j(k) is computed as Di,j(k) ← min {Di,j(k-1) , Di,k(k’) + Dk,j(k’’) } for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Lemma 1 - Proof Suppose Di,j(k) is computed as Di,j(k) ← min {Di,j(k-1) , Di,k(k’) + Dk,j(k’’) } for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths. To distinguish between the traditional FW Algorithm, we’ll use Ti,j(k) to denote the results calculated using the “new” computation way. ⇒Ti,j(k) ← min {Ti,j(k-1) , Ti,k(k’) + Tk,j(k’’) } for k-1 ≤ k’ , k’’ ≤ |V|

Lemma 1 - Proof First, we’ll show that for 1 ≤ k ≤ |V| the following inequality holds: Ti,j(k) ≤ Di,j(k) We Prove this by induction. Base case: by definition we have Ti,j(0) = Di,j(0)

Lemma 1 - Proof for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j(k) Induction step: suppose Ti,j(k) ≤ Di,j(k) for k = m-1. Then: Ti,j(m) ← min {Ti,j(m-1) , Ti,m(m’) + Tm,j(m’’) } ≤ min {Di,j(m-1) , Ti,m(m’) + Tm,j(m’’) } ≤ min {Di,j(m-1) , Ti,m(m-1) + Tm,j(m-1) } ≤ min {Di,j(m-1) , Di,m(m-1) + Dm,j(m-1) } = Di,j(m) Ti,j(k) ← min {Ti,j(k-1) , Ti,k(k’) + Ti,k(k’’) } Limiting the choices for intermediate vertices makes path same or longer By step of induction By step of induction By definition

Lemma 1 - Proof Suppose Di,j(k) is computed as Di,j(k) ← min {Di,j(k-1) , Di,k(k’) + Dk,j(k’’) } for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths. On the other hand, since the traditional algorithm computes the shortest paths at termination, and since Ti,j(|V|) is the length of some path, we have: Di,j(|V|) ≤ Ti,j(|V|) ⇒Di,j(|V|) = Ti,j(|V|) for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j(k)

FW’s Algorithm – Recursive Implementation We first consider the basic case of a two-node graph. Floyd-Warshall (T){ T11 = min {T11, T11 + T11} T12 = min {T12, T11 + T12} T21 = min {T21, T21 + T11} T22 = min {T22, T21 + T12} T22 = min {T22, T22 + T22} T21 = min {T21, T22 + T21} T12 = min {T12, T12 + T22} T11 = min {T11, T12 + T21} { 1 w1 w2 2

I II III IV FW’s Algorithm – Recursive Implementation The general case Floyd-Warshall (T){ If (not base case){ TI = min {TI , TI , TI} TII = min {TII , TI , TII} TIII = min {TIII , TIII , TI} TIV= min {TIV , TIII , TII} TIV = min {TIV , TIV , TIV} TIII = min {TIII , TIV , TIII} TII = min {TII , TII , TIV} TI = min {TI , TII , TIII} } else { … } {

FW’s Recursive Algorithm –Correctness It can be shown, that for each action Di,j(k) ← min {Di,j(k-1) , Di,k(k-1) + Dk,j(k-1) } in FW’s traditional implementation, there is a corresponding action, Ti,j(k) ← min {Ti,j(k-1) , Ti,k(k’) + Tk,j(k’’) }, where k-1 ≤ k’ , k’’ ≤ |V|. Hence the algorithm’s correctness follows from lemma 1.

T(0) T(|V|) FW’s Recursive Algorithm – How does it actually work… Floyd-Warshall (T){ If (not base case){ TI = min {TI , TI , TI} TII = min {TII , TI , TII} TIII = min {TIII , TIII , TI} TIV= min {TIV , TIII , TII} TIV = min {TIV , TIV , TIV} TIII = min {TIII , TIV , TIII} TII = min {TII , TII , TIV} TI = min {TI , TII , TIII} } else { … } { TII(0) TI(|V|/2) TI(0) TI(|V|) TII(|V|/2) TII(|V|) TIII(|V|/2) TIII(0) TIV(|V|/2) TIV(0) TIV(|V|) TIII(|V|)

1 2 3 6 8 4 5 7 FW’s Recursive Algorithm - Example 10 8 50 2 3 9 4 30 7 20 8 1 5

FW’s Recursive Algorithm – Example Floyd-Warshall (T){ T11 = min {T11, T11 + T11} T12 = min {T12, T11 + T12} T21 = min {T21, T21 + T11} T22 = min {T22, T21 + T12} T22 = min {T22, T22 + T22} T21 = min {T21, T22 + T21} T12 = min {T12, T12 + T22} T11 = min {T11, T12 + T21} { 50 18 1-3-4 20 16 7-6-8

FW’s Recursive Algorithm – Example Floyd-Warshall (T){ T11 = min {T11, T11 + T11} T12 = min {T12, T11 + T12} T21 = min {T21, T21 + T11} T22 = min {T22, T21 + T12} T22 = min {T22, T22 + T22} T21 = min {T21, T22 + T21} T12 = min {T12, T12 + T22} T11 = min {T11, T12 + T21} { 6 18 5 11 1-6-4 2-6-5 2-6-4 1-6-8 1-6-5 12 11 30 16 6 11 10 31 7-6-5 7-2-8 7-6-4 7-2-4

Representing the Matrix in an efficient way We usually store matrices in the memory in one of two ways: Using either of these layouts will not improve performance since the algorithm breaks the matrix into quadrants. Row-major layout: Column-major layout:

Representing the Matrix in an efficient way The Z-Morton layout: perform the following operations recursively until the quadrant size is of a single data unit: divide the matrix into four quadrants. store quadrant I, II, III, IV in the memory. For example:

Complexity Analysis Floyd-Warshall (T){ If (not base case){ TI = min {TI , TI , TI} TII = min {TII , TI , TII} TIII = min {TIII , TIII , TI} TIV= min {TIV , TIII , TII} TIV = min {TIV , TIV , TIV} TIII = min {TIII , TIV , TIII} TII = min {TII , TII , TIV} TI = min {TI , TII , TIII} } else { … } { The running time of the algorithm is given by T(|V|) = 8·T(|V|/2) = Θ(|V|3) Without considering Cache the number of cpu-memory transactions is exactly as the running time

Complexity Analysis - Theorem There exists some B, where B = O(|cache|1/2), such that, when using the FW-Recursive implementation, with the matrix stored in the Z-Morton layout, the number of cpu-memory transactions will be reduced by a factor of B. ⇒ there will be O(|V|3/B) cpu-memory transactions.

Complexity Analysis After k recursive calls, the size of a quadrant’s dimension is |V|/2k. There exists some k, such that B ≜ |V|/2k and 3 · B2 ≤ |cache| Once the above condition is fulfilled, 3 matrices of size B2 can be placed in the cache, and no further cpu-memory transactions are required. Floyd-Warshall (T){ If (not base case){ TI = min {TI , TI , TI} TII = min {TII , TI , TII} TIII = min {TIII , TIII , TI} TIV= min {TIV , TIII , TII} TIV = min {TIV , TIV , TIV} TIII = min {TIII , TIV , TIII} TII = min {TII , TII , TIV} TI = min {TI , TII , TIII} } else { … } { ⇒B = O(|cache|1/2)

Complexity Analysis Therefore we get: O(|V|/B)3 · O(B2) ⇒ the number of cpu-memory transactions is reduced by a factor of B. = O(|V|3/B) Transaction complexity of FW, when the size of the matrix dimension is |V|/B, and there’s no cache Transactions required in order to bring a BxB quadrant into the cache

Complexity Analysis – lower bound In “I/O complexity: The Red Blue Pebble Game” J.Hong and H.Kung have shown that the lower bound on cpu-memory transactions for multiplying matrices is Ω(N3/B) where B = O(|cache|1/2)

Complexity Analysis – lower bound – Theorem The lower bound on cpu-memory transactions for the Floyd Warshall algorithm is Ω(|V|3/B) where B = O(|cache|1/2) Proof: by reduction

Complexity Analysis – lower bound theorem - Proof  |V| for k from 1 to N for i from 1 to N for j from 1 to N Ck,i += Ak,j · Bj,I  |V|  |V|    ← min {Di,j(k-1) , Di,j(k) Di,k(k-1) + Dk,j(k-1) }

Complexity Analysis - Conclusion The algorithm’s complexity: O(|V|3/B) Lower bound for FW: Ω(|V|3/B) The recursive implementation is asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

FW’s Algorithm – Recursive Implementation - Comments Note, that the size of the cache is not part of the algorithm’s parameters, neither it is needed in order to store the matrix in the Z-Morton layout. Therefore: the algorithm is cache- oblivious

FW’s Algorithm – Recursive Implementation - Comments Though the analysis model included only a single hierarchy of cache, since no special attributes were defined, the proofs can be generalized to multiple levels of cache. L1 Cache L0 Cache L2 Cache Main Memory

FW’s Algorithm – Recursive Implementation - Comments Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

FW’s Algorithm – Recursive Implementation - Improvement The algorithm can be further improved by making it cache conscious: performing the recursive calls until the problem size is reduced to B, and solving the B-size problem in the traditional way (saves recursive calls’ overhead) This modification showed up to 2x improvement of running time on some of the machines.

Coffee Break

FW’s Algorithm – Tiled Implementation Consider a special case of lemma 1 when k’, k’’ are restricted such that k - 1 ≤ k’, k’’ ≤ k - 1 + B Where |cache| ≤ 3 · B2 ( B = O(|cache|1/2)) Suppose Di,j(k) is computed as Di,j(k) ← min {Di,j(k-1) , Di,k(k’) + Dk,j(k’’) } for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths. This leads to the following tiled implementation of FW’s algorithm

FW’s Algorithm – Tiled Implementation Divide the matrix into BxB tiles Perform |V|/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix

FW’s Algorithm – Tiled Implementation Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix Each iteration consists of three phases: Phase I: performing FW’s algorithms on the (t,t)th tile (which is self-dependent).

FW’s Algorithm – Tiled Implementation Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix Phase II: updating the remainder of row t: Ai,j(k)← min{Ai,j(k-1), Ai,k(tB) + Ak,j(k-1)} updating the remainder of column t: Ai,j(k)← min{Ai,j(k-1), Ai,k(k-1) + Ak,j(tB)} During the tth iteration, k goes from i·(B-1) to i·B

FW’s Algorithm – Tiled Implementation Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix Phase III: updating the rest of the matrix: Ai,j(k)← min{Ai,j(k-1), Ai,k(tB) + Ak,j(tB)} During the tth iteration, k goes from i·(B-1) to i·B

1 2 3 6 8 4 5 7 FW’s Algorithm – Tiled Example 10 8 50 2 3 9 4 30 7 20 8 1 5

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix 20 6 7-2-8

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix 18 50 1-3-4

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix 11 18 1-6-8 2-6-4 1-6-4 2-6-5 6 5 1-6-5 30 11 12 11 10 7-6-5 7-6-4

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix 11 6 12 11 10

Representing the Matrix in an efficient way In order to match the data access pattern, a tile must be stored in continuous memory. Therefore, the Z-Morton layout is used.

FW’s Tiled Algorithm –correctness Let Di,j(k) be the result of the kth iteration of the traditional FW’s implementation. Even though Di,j(k) and Ai,j(k) may not be equal during the “inner” iterations, it can be shown, using induction, that at the end of each iteration, Di,j(k) = Ai,j(k) (where k = t·B)

Complexity Analysis - Theorem There exists some B, where B = O(|cache|1/2), such that, when using the FW-Tiled implementation, the number of cpu-memory transactions will be reduced by a factor of B. ⇒ there will be O(|V|3/B) cpu-memory transactions.

Complexity Analysis There are |V|/B x |V|/B tiles in the matrix. There are |V|/B iterations in the algorithm, in each iteration, all tiles are accessed. Updating a tile requires holding at most 3 tiles in the cache. Divide the matrix into BxB tiles Perform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix ⇒ 3 · B2 ≤ |cache|

Complexity Analysis Therefore we get: (|V|/B) · [(|V|/B)x (|V|/B)] · O(B2) ⇒ the number of cpu-memory transactions is reduced by a factor of B. = O(|V|3/B) The number of iterations The size of the matrix Transactions required in order to bring a BxB tile into the cache

Optimizing Graph Algorithms for Improved Cache Performance

Optimizing Graph Algorithms for Improved Cache Performance

Presentation Transcript

Cache Algorithms

Graph Algorithms

Graph Algorithms

Graph Algorithms

Graph Programs for Graph Algorithms

Improved Decremental Algorithms for

Graph Algorithms

Graph Algorithms

Graph Algorithms

Optimizing Graph Algorithms on Pregel-like Systems *

Graph Algorithms

Graph Algorithms

Graph Algorithms

Graph Algorithms

Optimizing Cache Performance in Matrix Multiplication

Graph Algorithms

Graph Programs for Graph Algorithms

Graph Algorithms

Graph Algorithms