1 / 76

Sparse LA

Sparse LA. Sathish Vadhiyar. Motivation. Sparse computations much more challenging than dense due to complex data structures and memory references Many physical systems produce sparse matrices. Sparse Matrix-Vector Multiplication. Cache in GPUs.

dawn
Télécharger la présentation

Sparse LA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse LA Sathish Vadhiyar

  2. Motivation • Sparse computations much more challenging than dense due to complex data structures and memory references • Many physical systems produce sparse matrices

  3. Sparse Matrix-Vector Multiplication

  4. Cache in GPUs • Fermi has 16KB/48KB L1 cache per SM, and a global 768 KB L2 cache • Each cache line is 128 bytes, to provide high memory bandwidth • Thus when using 48KB cache, only (48KB/128bytes=384) cache lines can be stored • GPUs execute up to 1536 threads per SM • If all threads access different cache lines, 384 of them will get cache hits, and others will get cache miss • Thus threads in the same block should work on the same cache lines

  5. Compressed Sparse Row (CSR) Format • SpMV (Sparse Matrix-Vector Multiplication) using CSR

  6. Naïve CUDA Implementation • Assign one thread to each row • Define block size as multiple of warp size, e.g., 128 • Can handle max of 128x65535=8388480 rows, where 65535 is the max number of blocks

  7. Naïve CUDA Implementation - Drawbacks • For large matrices with several elements per row, the implementation suffers from a high cache miss rate since cache can’t hold all cache lines being used • Thus coalescing/caching is poor for long rows • If nnz per row have high variance, warp divergence will occur

  8. Thread Cooperation • Multiple threads can be assigned to work on the same row • Cooperating threads act on adjacent elements of the row; perform multiplication with elements of vector x; add up their results in shared memory using reduction • First thread of the group writes the result to vector y • If the number of cooperating threads, coop, is less than warp size, the synchronization between cooperating threads is implicit

  9. Analysis • Same cache lines are used by cooperating threads • Improves coalescing/caching • If the length of the row is not a multiple of 32, can lead to warp divergence, and loss in performance Coop = 4 Coalesced access; Good caching Warp divergence

  10. Granularity • If a group of cooperating threads act on only row, then the number of blocks required for entire matrix may be more than 65535 • Thus, more than one row per cooperating group can be processed • Number of rows processing by a cooperating group is denoted as repeat • A thread block processes repeat*blockSize/coop consecutive rows • An algorithm can be parametrized by blockSize, coop and repeat

  11. Parametrized Algorithm

  12. References • Efficient Sparse Matrix-Vector Multiplication on cache-based GPUs. Reguly, Giles. InPar 2012.

  13. Tridiagonal Matrices

  14. Parallel solution of linear system with special matricesTridiagonal Matrices x1 x2 x3 . . xn b1 b2 b3 . . bn a1 h1 g2 a2 h2 g3 a3 h3 gn an = In general: gixi-1 + aixi + hixi+1 = bi Substituting for xi-1 and xi+1 in terms of {xi-2, xi} and {xi, xi+2} respectively: Gixi-2 + Aixi + Hixi+2 = Bi

  15. Tridiagonal Matrices x1 x2 x3 . . xn B1 B2 B3 . . Bn A1 H1 A2 H2 G3 A3 H3 G4 A4 H4 Gn-2 An = Reordering:

  16. Tridiagonal Matrices x2 x4 . xn x1 x3 . xn-1 B2 B4 . Bn B1 B3 . Bn-1 A2 H2 G4 A4 H4 Gn An A1 H1 G3 A3 H3 Gn-3 An-1 =

  17. Tridiagonal Systems • Thus the problem of size n has been split into even and odd equations of size n/2 • This is odd–even reduction • For parallelization, each process can divide the problem into subproblems of smaller size and solve the subproblems • This is divide-and-conquer technique

  18. Tridiagonal Systems - Parallelization • At each stage one representative process of the domain of processes is chosen • This representative performs the odd-even reduction of problem i to two problems of size i/2 • The problems are distributed to 2 representatives n 1 n/2 2 6 n/4 1 3 5 7 n/8 1 2 3 4 5 6 7 8

  19. Sparse Cholesky • To solve Ax = b; • Most of the research and the base case are in sparse symmetric positive definite matrices • A = LLT; Ly = b; LTx = y; • Cholesky factorization introduces fill-in

  20. Column oriented left-looking Cholesky

  21. 3 7 1 3 7 1 6 8 6 8 4 10 4 10 9 2 9 2 5 5 Fill-in Fill:new nonzeros in factor

  22. Permutation Matrix or Ordering • Thus ordering to reduce fill or to enhance numerical stability • Choose permutation matrix P so that Cholesky factor L’ of PAPT has less fill than L. • Triangular solve: L’y = Pb; L’Tz = y; x = PTz • The fill can be predicted in advance • Static data structure can be used – symbolic factorization

  23. Steps • Ordering: • Find a permutation P of matrix A, • Symbolic factorization: • Set up a data structure for the Cholesky factor L of PAPT, • Numerical factorization: • Decompose PAPT into LLT, • Triangular system solution: • Ly = Pb; LTz = y; x = PTz.

  24. Sparse Matrices and Graph Theory 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 1 3 6 3 6 3 6 6 5 5 5 5 7 7 7 7 4 4 4 4 2 2 G(A)

  25. Sparse and Graph 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 6 6 5 7 7 7 F(A)

  26. Ordering • The above order of elimination is “natural” • The first heuristic is minimum degree ordering • Simple and effective • But efficiency depends on tie breaking strategy • Difficult to parallelize!

  27. Minimum degree ordering for the previous Matrix 1 2 3 4 5 6 7 6 Ordering – {2,4,5,7,3,1,6} No fill-in ! 5 7 3 4 2 1

  28. Ordering • Another ordering is nested dissection (divide-and-conquer) • Find separator S of nodes whose removal (along with edges) divides the graph into 2 disjoint pieces • Variables in each pieces are numbered contiguously and variables in S are numbered last • Leads to bordered block diagonal non-zero pattern • Can be applied recursively • Can be parallelized using divide-and-conquer approach

  29. Nested Dissection Illustration 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 5 10 15 20 25 S

  30. Nested Dissection Illustration 1 2 21 11 12 3 4 22 13 14 5 6 23 15 16 7 8 24 17 18 9 10 25 19 20 S

  31. Numerical Factorization cmod(j, k): modification of column j by column k, k < j cdiv(j) : division of column j by a scalar

  32. Algorithms

  33. Elimination Tree • T(A) has an edge between two vertices i and j, with i > j, if i = p(j), i.e., L(i, j) is first non-zero entry in the jth column below diagonal. i is the parent of j. 10 9 8 5 4 2 7 3 6 1

  34. Parallelization of Sparse Cholesky • Most of the parallel algorithms are based on elimination trees • Work associated with two disjoint subtrees can proceed independently • Same steps associated with sequential sparse factorization • One additional step: assignment of tasks to processors

  35. Ordering • 2 issues: - ordering in parallel - find an ordering that will help in parallelization in the subsequent steps

  36. Ordering in Parallel – Nested dissection • Nested dissection can be carried in parallel • Also leads to elimination trees that can be parallelized during subsequent factorizations • But parallelization only in the later levels of dissection • Can be applied to only limited class of problems • More later….

  37. Ordering for Parallel Factorization No agreed objective for ordering for parallel factorization: Not all orderings that reduce fill-in can provide scope for parallelization 7 1 2 3 6 4 5 5 6 7 4 Natural order 3 2 No fill. No scope for parallelization 1 Elimination Tree

  38. Example (Contd..) 1 1 2 3 3 4 5 2 6 7 7 Nested dissection order 7 4 6 3 6 5 1 2 4 5 Elimination Tree Fill. But scope for parallelization

  39. Ordering for parallel factorization – Tree restructuring • Decouple fill reducing ordering and ordering for parallel elimination • Determine a fill reducing ordering P of G(A) • Form the elimination tree T(PAPT) • Transform this tree T(PAPT) to one with smaller height and record the corresponding equivalent reordering, P’

  40. Ordering for parallel factorization – Tree restructuring • Efficiency depends on if such an equivalent reordering can be found • Also on the limitations of the initial ordering, P • Only minor modifications to the initial ordering. Hence only limited improvement in parallelism • Algorithm by Liu (1989) based on elimination tree rotation to reduce the height • Algorithm by Jess and Kees (1982) based on chordal graph to reduce the height

  41. Height and Parallel Completion time • Not all elimination trees with minimum heights give rise to small parallel completion times. • Let each node, v, in elimination tree be associated with x,y • x – time[v] or time for factorization of column v • y – level[v] • level[v] = time[v] if v is the root of elimination tree, time[v]+level[parent of v] otherwise • Represents minimum time to completion starting at node v • Parallel completion time – maximum level value among all nodes

  42. Height and Parallel Completion time h i 5, 5 9 f a b c e d f g g e 5 8 6, 11 3, 8 3,3 e 9 h d 7 3, 11 6, 17 4 3, 6 5, 8 f d 8 4 i c 4, 21 3, 14 3 g c 6 7 3, 9 3 6, 14 3, 17 b 2 h b 3, 12 6 2 6, 20 a 2, 19 5 1 i a 4, 24 2, 14 1

  43. Minimization of Cost • Thus some algorithms (Weng-Yang Lin, J. of Supercomputing: 2003) pick at each step for ordering, the nodes with the minimum cost (greedy approach)

  44. Nested Dissection Algorithms • Use a graph partitioning heuristic to obtain a small edge separator of the graph • Transform the small edge separator into a small node separator • Number nodes of the separator last and recursively apply

  45. K-L for ND • Form a random initial partition • Form edge separator by applying K-L to form partitions P1 and P2 • Let V1 in P1 such that nodes in V1 incident on atleast one edge in the separator set. Similarly V2 • V1 U V2 (wide node separator), • V1 or V2 (narrow node separator) by Gilbert and Zmijewski (1987)

  46. Step 2: Mapping Problems on to processors • Based on elimination trees • Various strategies to map columns to processors based on elimination trees. • Two algorithms: • Subtree-to-Subcube • Bin-Pack by Geist and Ng

  47. Naïve Strategy 2 1 0 3 2 1 0 0 3 3 2 2 1 1 1 1 0 0 0 0 0 0 0 0

More Related