Optimizing Data Permutations for SIMD Devices

Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center

SIMD Is Everywhere + + + + ALU Register File Memory SIMD Architecture

SIMD Compilation for(i=0; i<16; i++) c[i] = a[i] + b[i]; int a[16],b[16],c[16]; for(i=0; i<16; i++) c[i] = a[i] + b[i]; Explore Data Parallelism Explore Data Parallelism c[0:15] = a[0:15] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Generating Efficient SIMD Code Generating Efficient SIMD Code ... vr1 = vec_load(a); vr2 = vec_load(b); vr3 = vec_add(vr1, vr2); ... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Vectorization • Instruction Packing • If Conversion • …… • Data Permutation Optimization • Idiom Recognition • Execution Mapping • Type Promotion Elimination • ……

Strict SIMD Architecture (1) a0 a0 a0 a1 a1 a1 a2 a2 a2 a3 a3 a3 + + + + • Most SIMD devices only support memory accesses on contiguous and aligned memory sections ... = ...a[0:3:1]...;  vr1 = vec_load(a); a0 a1 a2 a3 a4 a5 a6 a7 …… ALU Register File Memory

Strict SIMD Architecture (2) a0 a2 a1 a3 a4 a6 a5 a7 a0 a4 a4 a0 a0 a4 a2 a5 a5 a5 a1 a1 a2 a2 a6 a6 a4 a6 a6 a7 a7 a3 a7 a3 a0 a1 a2 a3 a4 a5 a6 a7 a0 a2 a4 a6 + + + + vperm <0,2,4,6> • Additional permutation instructions are needed for non-contiguous and/or misaligned memory references ... = ...a[0:6:2]...; vr1 = vec_load(a); vr2 = vec_load(a+4); vr4 = vperm(vr1, vr2, <0,2,4,6>); a0 a1 a2 a3 a4 a5 a6 a7 …… ALU Register File Strict SIMD devices: All data reorganization must be accomplished with permutation instructions. Memory

Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ...

Example: An 8-point FFT Program 1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0:1] + t3[2:3];9. y[i+4:i+6:2] = t3[0:1] - t3[2:3];10. } 1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0] + t3[2:3];9. y[i+4:i+6:2] = t3[0] - t3[2:3];10. } 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 0 1 2 3 Generating native permutation instructions from Permute operations

Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Use generic Permute to represent: • Non-unit strides • Misalignment • Other reorganizations

Data Permutations on Vectors a2 0 0 a0 a1 1 a1 1 a2 2 a0 2 a3 3 a3 3 t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7]; t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7]; ... = a[0:6:2] + a[1:7:2]; ... = a[0:6:2] + a[1:7:2]; • Permute(Xn, Pn): Xn is a vector and Pn is a permutation matrix • Use Permute to represent all data reorganizations explicitly a[0:3] b[0:3] b[0:3] = Permute(a[0:3], <2,1,0,3>) Two stride-2 accesses at right-hand side

Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Minimize Permute ops in a basic block • - Based on two rules of Permute • A NP-complete problem • Propagation-based algorithm

Two Important Rules on Permutations a0 a1 x0 x0 a1+b1 x0 x0 a0 x0 b0 a1+b1 x0 b1 x0 x0 a0 a0+b0 x0 a3 x0 x0 a1 x0 b0 x1 a1 a1+b1 a0+b0 a1 x1 x1 x1 a0 a0 x1 a0+b0 x1 a1 x1 x1 b1 b1 b0 x1 x1 x1 a0 x1 x2 x2 x2 a2 b2 x2 x2 a3 a2 a2+b2 x2 a1 x2 x2 x2 a2 a3+b3 b2 x2 x2 x2 a3 a3+b3 b3 a3 x3 a3 x3 a2+b2 x3 b2 x3 x3 b3 a2+b2 a2 x3 a2 b3 x3 x3 a3+b3 x3 a2 x3 x3 a3 x3 + + • Composition Rule • Distributive Rule Permute(Permute(a[0:3:1], <1, 0, 3, 2>), <2, 1, 0, 3>) Permute(a[0:3:1], <3, 0, 1, 2>) Permute(a[0:3:1], <1, 0, 3, 2>) + Permute(b[0:3:1], <1, 0, 3, 2>) Permute(a[0:3:1] + b[0:3:1], <1, 0, 3, 2>)

Propagation-Based Optimization Algorithm 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; • Overview: Propagating permutation to permutation • Step 1: Pickup an unvisited permutation statement • Step 2: Propagate the permutation from the definition to the uses • Step 3: If a use is a permutation, goto (a), otherwise goto (b) • Merge it with the propagated permutation pattern. Goto Step 1 • Propagate the permutation from right-hand side to left-hand side. Goto Step 2 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2’[0:7] * u2[0:7];12. u3[0:7] = Permute(t3[0:7], P6’);13. y[0:3] = u3[0:3] + u3[4:7];14. y[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

Propagating Permutations to Partial Uses b[0:3] and b[4:7] are two partial uses of b[0:7]. b[0:3] = Permute(a[0:3], <3,2,1,0>); b[4:7] = Permute(a[4:7], <3,2,1,0>); c[0:3] = b[0:3] + b[4:7]; b[0:3] = Permute(a[0:3], <3,2,1,0>); b[4:7] = Permute(a[4:7], <3,2,1,0>); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <0,4,1,5,2,6,3,7>); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; Q b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; P b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; R Not all permutations can be partitioned and propagated to partial uses • Improvements over partial use boundary • - Permutation decomposition • Register-wise decomposition • Shuffle instruction decomposition • Permutation reshaping

Optimization: Permutation Reshaping a0 a0 a0 a0+a4 a4 a4 a0 c0 a4 a4 a0+a4 a0+a4 a4 a0 a0 a4 a4 a0 a4 a0+a4 c0 a0 a5+a1 c1 a1 a1 a5 a1 a5 a5 a5 a5+a1 a5 a1+a5 a5 a5 a1 a1 a1 a5+a1 a1 a1 a5 c1 a6 c2 a6 a2+a6 a2 a6 a2+a6 a6 a2 a6 a2 a2+a6 a2 a6 a2 c2 a2 a2 a6 a2 a2+a6 a6 a7 a7+a3 c3 a7+a3 a7 a3 a7 a3 a3 a3 a3 a3 a7 a3 a7 a3 a7 c3 a3+a7 a7 a7 a7+a3 b[0:7] = Permute(a[0:7], <0,5,2,7,4,1,6,3>);c[0:4] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <0,1,2,3,4,5,6,7>);c[0:4] = b[0:3] + b[4:7]; b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85]; b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85]; + + • For permutations used in commutative operations

Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • “Strip-mine” Permute to vperm inst. • Map vperm to native permutation inst.

Generating Permutation Instructions (1) vperm vperm vperm vperm 0 0 2 0 0 0 0 0 12 1 8 4 0 0 0 0 0 2 0 1 0 1 0 2 0 3 0 3 0 3 0 0 0 0 0 0 4 1 7 1 6 1 5 4 1 1 1 1 6 7 6 1 1 1 4 1 13 9 7 1 5 1 1 1 1 1 4 4 5 1 1 5 2 8 2 11 9 2 2 2 8 * 10 2 10 2 2 6 2 14 2 2 2 * 2 8 11 * 2 2 2 2 10 2 9 2 * * 3 15 3 7 3 3 * * 3 * 3 * 3 * 3 3 * 3 13 15 * * 14 3 3 3 3 11 3 3 12 3 * 3 * 3 vperm vperm <0,1,4,*> vperm vperm vperm vperm <0,4,*,*> vperm <0,1,2,4> vperm vperm vperm vperm a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>); b[0:15] a[0:15]

Generating Permutation Instructions (2) 0 0 2 10 4 0 0 0 0 0 8 0 0 8 0 0 3 0 12 0 0 0 0 0 8 8 0 2 0 0 0 1 1 13 1 1 12 5 1 9 4 4 1 6 1 1 1 1 1 5 12 1 1 1 6 4 7 1 1 1 12 14 4 1 2 14 2 2 8 2 2 2 2 2 11 2 2 10 10 6 3 * 9 11 2 * 2 2 2 9 1 2 2 2 9 1 3 3 14 5 3 3 3 13 3 3 13 12 3 7 3 3 7 * 3 5 3 15 3 3 15 3 11 15 13 * 3 3 vperm vperm vperm vperm vperm vperm <0,4,*,*> <0,4,*,*> <0,4,1,5> <0,4,1,5> <0,1,4,5> <2,3,6,7> vperm vperm vperm vperm vperm vperm vperm vperm a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>); b[0:15] a[0:15] • Two Steps: • Maximize empty slots when generating vperm instructions; • Fill empty slots with data elements that go to the same target;

Experiment Setups • Two SIMD devices: VMX(AltiVec) & SSE2 • Tested applications • Group I : Applications with relatively simple permutation patterns • C-Saxpy: Complex version of saxpy ( y = alpha*x + y ) • R-Color, C-Dot, R-FIR, … • Group II: Applications with complicated permutation patterns • FFT: Fast Fourier transform programs generated by the SPIRAL system • WHT: Walsh-Hadamard transform routines generated by the SPIRAL system • Bitonic sorting: One of the fastest sorting networks • Group III: Reorganization-only applications • Matrix transpose • Bit-reversal reordering

Static Evaluation: # of Permutation Inst.

Run-time Performance of FFT & Bitonic Sorting

Overall Speedups 2 2 3 3 1 1

Related Work • Optimizing permutation instructions introduced by misalignment • A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with alignment constraints, PLDI ’04 • P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtime Alignment and Length Conversion, CGO 05 • Efficient permutation instruction generation • A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCTES ’05 • M. Narayanan, K. Yelick, Generating permutation instructions from a high-level description, MSP ’04 • D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD, PLDI ’06 • Similar idea, different applications • A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching for bit-streaming programs, PLDI ’05 • S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in data-parallel programs, POPL ’93 • G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FORTRAN 90 programs, PPOPP ’95

Conclusion • It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions • A unified framework is proposed to optimize data permutations • Putting all forms of data permutations into a unified representation • Propagating permutations across statements and merging them together • Generating efficient permutation instructions natively supported by devices • Experiments were conducted on different applications • Up to 77% permutation instructions are eliminated • Improve average performance by 48% on VMX and 68% on SSE2 • Near-peak overall speedups are achieved on some applications

Thank You! June 2006

Optimizing Data Permutations for SIMD Devices

Optimizing Data Permutations for SIMD Devices

Presentation Transcript

Optimizing Data Permutations for SIMD Devices

New Algorithms for SIMD Alignment

Permutations

SIMD

Permutations

Permutations

Permutations

Permutations

Permutations

Optimizing PSK for Correlated Data

SIMD Architectures

PERMUTATIONS

Permutations

Permutations

Permutations

Permutations

Permutations

Permutations