410 likes | 560 Vues
This talk addresses advanced compiler and runtime techniques for optimizing numerical codes through symbolic program transformation. Key topics include sparse matrix code generation, memory hierarchy optimizations, and enhancing MATLAB performance. We dive into LU factorization, exploring techniques such as automatic blocking and dependence analysis to mitigate poor cache behavior. The discussion emphasizes the legality of program transformations, ensuring that transformations preserve the correctness of matrix operations. The presented frameworks, including Fractal Symbolic Analysis, pave the way for efficient code generation and optimizations.
E N D
Symbolic Program Transformation for Numerical Codes Vijay Menon Cornell University
Background • Compiler/Runtime Support for Numerical Programs (under Keshav Pingali) • Sparse matrix code generation • Memory hierarchy optimizations • Compiler/runtime techniques for MATLAB • MAJIC (with University of Illinois) • MultiMATLAB
Motivation • Using matrix properties to generate efficient code • e.g., Associativity, Commutativity of matrix operations • Problem: loop notation • Matrix operations hidden within loop nests
Example: LU Factorization with Partial Pivoting • Also called: Gaussian Elimination • Key algorithm for solving systems of linear equations: • To solve A x = b for x: • => Factor A into L U • => Solve L y = b for y • => Solve U x = y for x
LU Factorization do j = 1,N p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i; do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp; do i = j+1,N A(i,j) = A(i,j)/A(j,j); do k = j+1,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); Select pivot row: Swap pivot row with current: Scale column (to store L): Update(to compute partial U): • Problem: Poor cache behavior x x x x x x 0 x x x x x 0 0 5 x x x 0 0 3 x x x 0 0 7x x x 0 0 2 x x x
Automatic Blocking do jB = 1,N,B do j = jB,jB+B-1 p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i; do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp; do i = j+1,N A(i,j) = A(i,j)/A(j,j); do k = j+1,jB+B-1 do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); do j = jB,jB+B-1 do k = jB+B,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); • Compiler transformations: • Stripmine • Index-Set Split • Loop Distribute • Tile/Map to BLAS • Problems: • Establishing legality • Mapping to BLAS
Key Points • Conventional techniques (dependence analysis) are insufficient to establish legality • Detection of matrix operations buys extra performance
Overview of this Talk • Fractal Symbolic Analysis • Framework for Symbolically determining Legality of Program Transformations • Matrixization • Generalization of Vectorization to Matrix Operations
Fractal Symbolic Analysis • Symbolic test to establish legality of program transformations for i = 1 : n S1(i); for i = 1 : n S2(i); for i = 1 : n S1(i); S2(i);
Dependence Analysis • Independent operations may be reordered • Legality: No data dependence from S2(m) to S1(l) where l > m for i = 1 : n S1(i); for i = 1 : n S2(i); for i = 1 : n S1(i); S2(i);
Symbolic Comparison • Dependence Analysis is conservative: • Symbolic execution shows equality: • aout = 2*(ain + bin) • bout = 2*bin • But, intractable for recurrent loops s1: a = 2*a s2: b = 2*b s3: a = a+b s3: a = a+b s1: a = 2*a s2: b = 2*b ?
Distillation of LU • Given p(j) j, prove: • Dependence analysis: too conservative • Symbolic comparison: intractable for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for j = 1:n for i = j+1:n a(i) = a(i)/a(j) for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for i = j+1:n a(i) = a(i)/a(j) B1(j): B1(j): B2(j): B2(j):
Idea • Simplify • Prove equality of complex programs via equality of simpler programs • Repeat if necessary • Sufficient, but not necessary • equality of simpler programs -> equality of original programs
Commutativity S1, S2, S3, …,Sn = Sf(1), Sf(2), Sf(3), …,Sf(n) i,j 1..n | i < j f(i) > f(j). Si; Sj = Sj; Si • Permutation Sequence of Adjacent Transpositions
Commutative Legality • Loop Distribution • Legality: 1 m < l N. S2(m); S1(l) = S1(l); S2(m) for i = 1 : n S1(i); S2(i); for i = 1 : n S1(i); for i = 1 : n S2(i);
Commutative Legality • Similar conditions: • Statement reordering • Loop interchange • Loop reversal • Loop tiling • Compiler establishes legality of transformation by testing commute condition
Back to Example • Given p(j) j, prove: • Simplified comparison => for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for j = 1:n for i = j+1:n a(i) = a(i)/a(j) for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for i = j+1:n a(i) = a(i)/a(j) B1(j): B1(j): B2(j): B2(j):
Simplified Comparison • Given p(j) j l > m prove: • Further simplification? for i = m+1:n a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp for i = m+1:n a(i) = a(i)/a(m) B2(m): B1(l): B1(l): B2(m):
Too Simple! • Given p(j) j l > m i > m prove: • Too conservative - no longer legal! • Hypothesis: • Repeated application -> dependence analysis tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp a(i) = a(i)/a(m) a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp
Symbolic Comparison • Can we symbolically prove?: • Effect can be summarized: • non-recurrent loops • affine indices/loop bounds tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp for i = m+1:n a(i) = a(i)/a(m) for i = m+1:n a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp B2(m): B1(l): B1(l): B2(m):
Guarded Expressions • Compare symbolic values of live output variable in terms of input variables: • aout (k)= • Each • guard : affine expression defining part of aout • expr : symbolic expression on input data guard1 -> expr1 guard2 -> expr2 …… guardn -> exprn
Guarded Expressions • For both program blocks: • aout(k) = • Omega Library (integer programming tool) to manipulate/compare affine guards k m => ain(k) k = l => ain(p(l))/ain(m) k = p(l) => ain(l)/ain(m) else => ain(k)/ain(m)
Summary of Fractal Symbolic Analysis • Powerful legality test • Explores tradeoffs between • tractability (dependence analysis) • accuracy (symbolic comparison) • Similar application for LU w/ pivoting • 2 recursive simplification steps • 6 guarded regions/expressions • Prototype implemented in OCAML
Overview of this Talk • Fractal Symbolic Analysis • Framework for Symbolically determining Legality of Program Transformations • Matrixization • Generalization of Vectorization to Matrix Operations
Matrixization • Detect Matrix Operations • Map to • hand-optimized libraries (BLAS) • special-purpose hardware • Eliminate loop overhead • MATLAB: type/bounds checks • Exploit Matrix Semantics • Ring Properties of Matrix (+,*)
Galerkin Example • In Galerkin, 98% of time spent in: • for i = 1:N • for j = 1:N • phi(k) += A(i,j)*x(i)*y(i); • end • end • A - Matrix • x, y - Vector
Vectorized Code • In Optimized Galerkin: phi(k) += x*A*y’; • Fragment Speedup: 260 • Program Speedup: 110
Conventional Approaches • Syntactic Pattern Replacement (KAPF,VAST,FALCON) • Can encode high-level properties • Limited use on loops • Vector Code Generation • Can detect array operations in loops • Cannot detect/exploit matrix products
Our Approach • Map code to: Abstract Matrix Form (AMF) • Convert to Symmetric AMF formulation • Optimize AMF expressions • factorization • invariant removal • Detect Matrix Products • Map AMF to MATLAB/BLAS
Expansion • Array expressions: • Forall loops: a(1:m,1:n) -> i j ai,j for i = 1:n x(i) = x(i) + y(i); -> ixi = i(xi+yi) end
Reduction • Sum Reduction Loops for i = 1:n k = k + x(i); -> k = k + î i xi end
Product • Matrix Product C = A*B -> ijCi,j= P(ijAi,h,ijBh,j) • Product-Reduction Equivalence: • e1 and e2 are scalar • e1 is constant w.r.t. ik+1…in • e2 is constant w.r.t. i1…ik-1 îk (i1…ine1 * i1…ine2) = Pîk(i1…ike1, ik…ine2)
Other AMF Properties • Distributive Properties • e1·î e2 = î((ie1) ·e2) • when e1 is constant w.r.t. i • Interchange Properties • if(e1, e2,…, en) = f(ie1, ie2,…, ien) • i e = i e • where î= • i j e = j i e • î e = î e
Back to Galerkin • Original: • k = k + îi j (ai,j * xi * yj) • Convert to Symmetric Form: • k = k + î(i j ai,j * i j xi * i j yj) • Optimize: • k = k + îi xi * (i j ai,j * i j yj) • Map to Matrix Operations: • k = k + Pî (i xi , P(i j ai,j,j yj)) • => k = k + x’*A*y;
Other Vectorization Examples • Taken from USENET: for i = 1:n for j = 1:n C(i,j) = A(i,j)*x(i); C(1:n,1:n) = A(1:n,1:n) .* repmat(x(1,n),1,n) for i = 1:n x(i) = y(:,i)*A*y(:,i)’; x(1:n) = sum(y.*(y*A’),2);
Summary/Status of Matrixization • General technique to detect matrix/vector operations • Implemented in MAJIC • MATLAB interpreter/compiler • Future work: • Extend to more ops: recurrences • JIT Matrixization
Related Work • Commutativity Analysis • Rinard & Diniz • Parallelization of OO programs • High-Level Pattern Replacement • DeRose, Marsolf, … • Exploit high-level matrix properties • Structure • Symmetry • Orthogonality
Conclusion • High-level symbolic techniques: • Fractal Symbolic Analysis • Matrixization • New techniques to analyze & exploit underlying computation and achieve substantial performance gains