1 / 24

Optimizing Data Permutations for SIMD Devices

Optimizing Data Permutations for SIMD Devices. Gang Ren , Peng Wu 1 , David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center . SIMD Is Everywhere. +. +. +. +. ALU. Register File. Memory. SIMD Architecture. SIMD Compilation. for(i=0; i<16; i++)

majed
Télécharger la présentation

Optimizing Data Permutations for SIMD Devices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center

  2. SIMD Is Everywhere + + + + ALU Register File Memory SIMD Architecture

  3. SIMD Compilation for(i=0; i<16; i++) c[i] = a[i] + b[i]; int a[16],b[16],c[16]; for(i=0; i<16; i++) c[i] = a[i] + b[i]; Explore Data Parallelism Explore Data Parallelism c[0:15] = a[0:15] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Generating Efficient SIMD Code Generating Efficient SIMD Code ... vr1 = vec_load(a); vr2 = vec_load(b); vr3 = vec_add(vr1, vr2); ... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Vectorization • Instruction Packing • If Conversion • …… • Data Permutation Optimization • Idiom Recognition • Execution Mapping • Type Promotion Elimination • ……

  4. Strict SIMD Architecture (1) a0 a0 a0 a1 a1 a1 a2 a2 a2 a3 a3 a3 + + + + • Most SIMD devices only support memory accesses on contiguous and aligned memory sections ... = ...a[0:3:1]...;  vr1 = vec_load(a); a0 a1 a2 a3 a4 a5 a6 a7 …… ALU Register File Memory

  5. Strict SIMD Architecture (2) a0 a2 a1 a3 a4 a6 a5 a7 a0 a4 a4 a0 a0 a4 a2 a5 a5 a5 a1 a1 a2 a2 a6 a6 a4 a6 a6 a7 a7 a3 a7 a3 a0 a1 a2 a3 a4 a5 a6 a7 a0 a2 a4 a6 + + + + vperm <0,2,4,6> • Additional permutation instructions are needed for non-contiguous and/or misaligned memory references ... = ...a[0:6:2]...; vr1 = vec_load(a); vr2 = vec_load(a+4); vr4 = vperm(vr1, vr2, <0,2,4,6>); a0 a1 a2 a3 a4 a5 a6 a7 …… ALU Register File Strict SIMD devices: All data reorganization must be accomplished with permutation instructions. Memory

  6. Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ...

  7. Example: An 8-point FFT Program 1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0:1] + t3[2:3];9. y[i+4:i+6:2] = t3[0:1] - t3[2:3];10. } 1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0] + t3[2:3];9. y[i+4:i+6:2] = t3[0] - t3[2:3];10. } 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 0 1 2 3 Generating native permutation instructions from Permute operations

  8. Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Use generic Permute to represent: • Non-unit strides • Misalignment • Other reorganizations

  9. Data Permutations on Vectors a2 0 0 a0 a1 1 a1 1 a2 2 a0 2 a3 3 a3 3 t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7]; t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7]; ... = a[0:6:2] + a[1:7:2]; ... = a[0:6:2] + a[1:7:2]; • Permute(Xn, Pn): Xn is a vector and Pn is a permutation matrix • Use Permute to represent all data reorganizations explicitly a[0:3] b[0:3] b[0:3] = Permute(a[0:3], <2,1,0,3>) Two stride-2 accesses at right-hand side

  10. Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • Minimize Permute ops in a basic block • - Based on two rules of Permute • A NP-complete problem • Propagation-based algorithm

  11. Two Important Rules on Permutations a0 a1 x0 x0 a1+b1 x0 x0 a0 x0 b0 a1+b1 x0 b1 x0 x0 a0 a0+b0 x0 a3 x0 x0 a1 x0 b0 x1 a1 a1+b1 a0+b0 a1 x1 x1 x1 a0 a0 x1 a0+b0 x1 a1 x1 x1 b1 b1 b0 x1 x1 x1 a0 x1 x2 x2 x2 a2 b2 x2 x2 a3 a2 a2+b2 x2 a1 x2 x2 x2 a2 a3+b3 b2 x2 x2 x2 a3 a3+b3 b3 a3 x3 a3 x3 a2+b2 x3 b2 x3 x3 b3 a2+b2 a2 x3 a2 b3 x3 x3 a3+b3 x3 a2 x3 x3 a3 x3 + + • Composition Rule • Distributive Rule Permute(Permute(a[0:3:1], <1, 0, 3, 2>), <2, 1, 0, 3>) Permute(a[0:3:1], <3, 0, 1, 2>) Permute(a[0:3:1], <1, 0, 3, 2>) + Permute(b[0:3:1], <1, 0, 3, 2>) Permute(a[0:3:1] + b[0:3:1], <1, 0, 3, 2>)

  12. Propagation-Based Optimization Algorithm 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7]; • Overview: Propagating permutation to permutation • Step 1: Pickup an unvisited permutation statement • Step 2: Propagate the permutation from the definition to the uses • Step 3: If a use is a permutation, goto (a), otherwise goto (b) • Merge it with the propagated permutation pattern. Goto Step 1 • Propagate the permutation from right-hand side to left-hand side. Goto Step 2 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2’[0:7] * u2[0:7];12. u3[0:7] = Permute(t3[0:7], P6’);13. y[0:3] = u3[0:3] + u3[4:7];14. y[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

  13. Propagating Permutations to Partial Uses b[0:3] and b[4:7] are two partial uses of b[0:7]. b[0:3] = Permute(a[0:3], <3,2,1,0>); b[4:7] = Permute(a[4:7], <3,2,1,0>); c[0:3] = b[0:3] + b[4:7]; b[0:3] = Permute(a[0:3], <3,2,1,0>); b[4:7] = Permute(a[4:7], <3,2,1,0>); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <0,4,1,5,2,6,3,7>); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; Q b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; P b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>); c[0:3] = b[0:3] + b[4:7]; R Not all permutations can be partitioned and propagated to partial uses • Improvements over partial use boundary • - Permutation decomposition • Register-wise decomposition • Shuffle instruction decomposition • Permutation reshaping

  14. Optimization: Permutation Reshaping a0 a0 a0 a0+a4 a4 a4 a0 c0 a4 a4 a0+a4 a0+a4 a4 a0 a0 a4 a4 a0 a4 a0+a4 c0 a0 a5+a1 c1 a1 a1 a5 a1 a5 a5 a5 a5+a1 a5 a1+a5 a5 a5 a1 a1 a1 a5+a1 a1 a1 a5 c1 a6 c2 a6 a2+a6 a2 a6 a2+a6 a6 a2 a6 a2 a2+a6 a2 a6 a2 c2 a2 a2 a6 a2 a2+a6 a6 a7 a7+a3 c3 a7+a3 a7 a3 a7 a3 a3 a3 a3 a3 a7 a3 a7 a3 a7 c3 a3+a7 a7 a7 a7+a3 b[0:7] = Permute(a[0:7], <0,5,2,7,4,1,6,3>);c[0:4] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], <0,1,2,3,4,5,6,7>);c[0:4] = b[0:3] + b[4:7]; b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85]; b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85]; + + • For permutations used in commutative operations

  15. Overview of the Optimization Framework c[0:15] = a[0:31:2] + b[0:15]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; Normalization Optimization Code Generation ... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... float a[16], b[16], c[16]; ... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2); ... • “Strip-mine” Permute to vperm inst. • Map vperm to native permutation inst.

  16. Generating Permutation Instructions (1) vperm vperm vperm vperm 0 0 2 0 0 0 0 0 12 1 8 4 0 0 0 0 0 2 0 1 0 1 0 2 0 3 0 3 0 3 0 0 0 0 0 0 4 1 7 1 6 1 5 4 1 1 1 1 6 7 6 1 1 1 4 1 13 9 7 1 5 1 1 1 1 1 4 4 5 1 1 5 2 8 2 11 9 2 2 2 8 * 10 2 10 2 2 6 2 14 2 2 2 * 2 8 11 * 2 2 2 2 10 2 9 2 * * 3 15 3 7 3 3 * * 3 * 3 * 3 * 3 3 * 3 13 15 * * 14 3 3 3 3 11 3 3 12 3 * 3 * 3 vperm vperm <0,1,4,*> vperm vperm vperm vperm <0,4,*,*> vperm <0,1,2,4> vperm vperm vperm vperm a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>); b[0:15] a[0:15]

  17. Generating Permutation Instructions (2) 0 0 2 10 4 0 0 0 0 0 8 0 0 8 0 0 3 0 12 0 0 0 0 0 8 8 0 2 0 0 0 1 1 13 1 1 12 5 1 9 4 4 1 6 1 1 1 1 1 5 12 1 1 1 6 4 7 1 1 1 12 14 4 1 2 14 2 2 8 2 2 2 2 2 11 2 2 10 10 6 3 * 9 11 2 * 2 2 2 9 1 2 2 2 9 1 3 3 14 5 3 3 3 13 3 3 13 12 3 7 3 3 7 * 3 5 3 15 3 3 15 3 11 15 13 * 3 3 vperm vperm vperm vperm vperm vperm <0,4,*,*> <0,4,*,*> <0,4,1,5> <0,4,1,5> <0,1,4,5> <2,3,6,7> vperm vperm vperm vperm vperm vperm vperm vperm a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>); b[0:15] a[0:15] • Two Steps: • Maximize empty slots when generating vperm instructions; • Fill empty slots with data elements that go to the same target;

  18. Experiment Setups • Two SIMD devices: VMX(AltiVec) & SSE2 • Tested applications • Group I : Applications with relatively simple permutation patterns • C-Saxpy: Complex version of saxpy ( y = alpha*x + y ) • R-Color, C-Dot, R-FIR, … • Group II: Applications with complicated permutation patterns • FFT: Fast Fourier transform programs generated by the SPIRAL system • WHT: Walsh-Hadamard transform routines generated by the SPIRAL system • Bitonic sorting: One of the fastest sorting networks • Group III: Reorganization-only applications • Matrix transpose • Bit-reversal reordering

  19. Static Evaluation: # of Permutation Inst.

  20. Run-time Performance of FFT & Bitonic Sorting

  21. Overall Speedups 2 2 3 3 1 1

  22. Related Work • Optimizing permutation instructions introduced by misalignment • A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with alignment constraints, PLDI ’04 • P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtime Alignment and Length Conversion, CGO 05 • Efficient permutation instruction generation • A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCTES ’05 • M. Narayanan, K. Yelick, Generating permutation instructions from a high-level description, MSP ’04 • D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD, PLDI ’06 • Similar idea, different applications • A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching for bit-streaming programs, PLDI ’05 • S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in data-parallel programs, POPL ’93 • G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FORTRAN 90 programs, PPOPP ’95

  23. Conclusion • It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions • A unified framework is proposed to optimize data permutations • Putting all forms of data permutations into a unified representation • Propagating permutations across statements and merging them together • Generating efficient permutation instructions natively supported by devices • Experiments were conducted on different applications • Up to 77% permutation instructions are eliminated • Improve average performance by 48% on VMX and 68% on SSE2 • Near-peak overall speedups are achieved on some applications

  24. Thank You! June 2006

More Related