BeBOP

A Computationally Efficient Triple Matrix Product for a Class of Sparse Schur-complement Matrices BeBOP Berkeley Benchmarking and OPtimization Group bebop.cs.berkeley.edu e=b+d ai a N c bi e 2 c ci ci c N + A H At P a 0 ai a X X N A*3t Ask* A B C Pk* d3 b d 1 bi=di e d9 A*3 A*9 N X j A*9t Sparse vectors Sparse accumulator A*j C*i a e c value B*i In memory, index ai ei ci A*j X X = P1 P2 Eun-Jin Im, Kookmin University, Seoul, Korea, ejim@eecs.berkeley.edu Ismail Bustany, Barcelona Design Inc., Ismail.Bustany@barcelonadesign.com Cleve Ashcraft, Livermore Software Technology Corp., cleveashcraft@earthlink.net James W. Demmel, U.C.Berkeley, demmel@eecs.berkeley.edu Katherine A. Yelick, U.C.Berkeley, yelick@eecs.berkeley.edu • Problem Context • In solving a primal-dual optimization problem for • a circuit design,computation of P=AHAtis • repeatedly executed. (100-120 times) • Hhas a symmetric block diagonal structure, • Hi = Di+ririt Two Implementations : One-Phase vs. Two-Phase Schemes Performance One-Phase Scheme This scheme can take advantage of known structure ofH, and symmetry ofP, using the following equation. Modeled and Measured Execution Time We predict lower and upper bounds of the execution time for one-phase and two-phase scheme using our memory model, and it is confirmed by measurements that one-phase scheme has advantage of execution time and memory over two-phase scheme. In addition, the preprocessing cost is lower in one-phase scheme. We compare two approaches to compute the triple-product. While one-phase scheme has an advantage over two-phase scheme by using a knowledge on the structure of matrix, the summation of sparse matrices becomes bottleneck. Hence, we propose a row-based one-phase scheme, where the summation of sparse matrices is replaced by the summation of sparse vectors, which can be computed efficiently using a sparse accumulator. We also improved the performance of the row-based one-phase scheme through use of additional data structures. H1 A1t X X A2t H2 A2 A1 Drawback : a summation of sparse matrices is slow. • Memory Performance Modeling • Memory Access • Dominant factor in One-phase scheme : • access of elements of A in A*H*At : Measured Performance Row-based One-phase Scheme Instead of adding sparse matrices, add sparse vectors for each row(column) ofP. Consider rowk of P (letB*i =Airi) Speedup • Two-Phase scheme • P=mult(A,Q=mult(H,At)) • In ComputingC=mult(A,B) • ForB*i= each column ofB, • For each nonzero ofB*i, do the following • Dominant factor in Two-phase scheme : • access of elements of A in A*B : • Cache Miss • For sequentially accessed elements, • spatial locality is assumed to be exploited. • Execution Time • Row-major structures of A and B are needed • to access{j: akj != 0}and{i: bki != 0}efficiently. Achieved Mflop rate Improved One-phase Scheme Efficient Sparse Vector Additionusing a sparse accumulator Example Matrix Set from Circuit Design Application Compute a matrixB. Create row-major structure ofAand B. For each row(column) ofP, Forj: akj != 0, do the following Fori: bki != 0, do the similar, without scaling factor, di Overhead of Preprocessing relative to execution time in two-phase scheme A sparse accumulator is used in one-phase and row-based two-phase schemes. Utilizing the Symmetry of P • Conclusion • Performance tuning of higher level sparse matrix • operation than matrix-vector multiplication • Speedup up to 2.1x • Less than half memory requirement • An example of algebraic transformation • is used for performance tuning • Knowledge on the special structure of the matrix • is used for the algebraic transformation. • Preprocessing • In one-phase scheme • counting the number of nonzeros inBandP • (to determine the amount of memory allocation) • computing the structure of matrix B • constructing row-major structure ofAand B • In two-phase scheme • generatingAt • counting the number of nonzeros in P and Q • In computingakj dj A*jt, computeakj dj Ak:m,jt • by keeping an array of indices pointing to • eachA*j ’s next nonzero element, • unnecessary access to A*j’s is avoided. • (# of accessestoA*j= # of nonzeros ofA*j)

BeBOP

BeBOP

Presentation Transcript

Bebop : A Symbolic Model Checker for Boolean Programs

The Bebop Revolution

BeBop

Ball is Over, Time to Listen: Bebop and Beyond

Bebop,Cool , Free Jazz

Chapter 11 – Bebop

Classic Bebop

10 facts about classical bebop

Bebop

bebop

Reactions to Bebop – Cool and Hard Bop

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization

Bebop Protocols

COOL JAZZ Just as BeBop contrasted from Swing, Cool Jazz will contrasted similarly from BeBop

BEBOP CONTEXT DIAGRAM

“The Bebop Era”

From Africa to BeBop

Jazz From beginning to bebop

The Bebop Express

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels