1 / 1

BeBOP

A Computationally Efficient Triple Matrix Product for a Class of Sparse Schur-complement Matrices. BeBOP. Be rkeley B enchmarking and OP timization Group bebop.cs.berkeley.edu. e=b+d. a i. a. N. c. b i. e. 2. c. c i. c i. c. N. +. A. H. A t. P. a. 0. a i. a. X. X.

mendel
Télécharger la présentation

BeBOP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Computationally Efficient Triple Matrix Product for a Class of Sparse Schur-complement Matrices BeBOP Berkeley Benchmarking and OPtimization Group bebop.cs.berkeley.edu e=b+d ai a N c bi e 2 c ci ci c N + A H At P a 0 ai a X X N A*3t Ask* A B C Pk* d3 b d 1 bi=di e d9 A*3 A*9 N X j A*9t Sparse vectors Sparse accumulator A*j C*i a e c value B*i In memory, index ai ei ci A*j X X = P1 P2 Eun-Jin Im, Kookmin University, Seoul, Korea, ejim@eecs.berkeley.edu Ismail Bustany, Barcelona Design Inc., Ismail.Bustany@barcelonadesign.com Cleve Ashcraft, Livermore Software Technology Corp., cleveashcraft@earthlink.net James W. Demmel, U.C.Berkeley, demmel@eecs.berkeley.edu Katherine A. Yelick, U.C.Berkeley, yelick@eecs.berkeley.edu • Problem Context • In solving a primal-dual optimization problem for • a circuit design,computation of P=AHAtis • repeatedly executed. (100-120 times) • Hhas a symmetric block diagonal structure, • Hi = Di+ririt Two Implementations : One-Phase vs. Two-Phase Schemes Performance One-Phase Scheme This scheme can take advantage of known structure ofH, and symmetry ofP, using the following equation. Modeled and Measured Execution Time We predict lower and upper bounds of the execution time for one-phase and two-phase scheme using our memory model, and it is confirmed by measurements that one-phase scheme has advantage of execution time and memory over two-phase scheme. In addition, the preprocessing cost is lower in one-phase scheme. We compare two approaches to compute the triple-product. While one-phase scheme has an advantage over two-phase scheme by using a knowledge on the structure of matrix, the summation of sparse matrices becomes bottleneck. Hence, we propose a row-based one-phase scheme, where the summation of sparse matrices is replaced by the summation of sparse vectors, which can be computed efficiently using a sparse accumulator. We also improved the performance of the row-based one-phase scheme through use of additional data structures. H1 A1t X X A2t H2 A2 A1 Drawback : a summation of sparse matrices is slow. • Memory Performance Modeling • Memory Access • Dominant factor in One-phase scheme : • access of elements of A in A*H*At : Measured Performance Row-based One-phase Scheme Instead of adding sparse matrices, add sparse vectors for each row(column) ofP. Consider rowk of P (letB*i =Airi) Speedup • Two-Phase scheme • P=mult(A,Q=mult(H,At)) • In ComputingC=mult(A,B) • ForB*i= each column ofB, • For each nonzero ofB*i, do the following • Dominant factor in Two-phase scheme : • access of elements of A in A*B : • Cache Miss • For sequentially accessed elements, • spatial locality is assumed to be exploited. • Execution Time • Row-major structures of A and B are needed • to access{j: akj != 0}and{i: bki != 0}efficiently. Achieved Mflop rate Improved One-phase Scheme Efficient Sparse Vector Additionusing a sparse accumulator Example Matrix Set from Circuit Design Application Compute a matrixB. Create row-major structure ofAand B. For each row(column) ofP, Forj: akj != 0, do the following Fori: bki != 0, do the similar, without scaling factor, di Overhead of Preprocessing relative to execution time in two-phase scheme A sparse accumulator is used in one-phase and row-based two-phase schemes. Utilizing the Symmetry of P • Conclusion • Performance tuning of higher level sparse matrix • operation than matrix-vector multiplication • Speedup up to 2.1x • Less than half memory requirement • An example of algebraic transformation • is used for performance tuning • Knowledge on the special structure of the matrix • is used for the algebraic transformation. • Preprocessing • In one-phase scheme • counting the number of nonzeros inBandP • (to determine the amount of memory allocation) • computing the structure of matrix B • constructing row-major structure ofAand B • In two-phase scheme • generatingAt • counting the number of nonzeros in P and Q • In computingakj dj A*jt, computeakj dj Ak:m,jt • by keeping an array of indices pointing to • eachA*j ’s next nonzero element, • unnecessary access to A*j’s is avoided. • (# of accessestoA*j= # of nonzeros ofA*j)

More Related