160 likes | 254 Vues
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450. Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia Collaborators: Jed Brown, Dr. John Gunnels. King Abdullah University of Science and Technology November 2011. Motivation.
E N D
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. AronAhmadia Collaborators: Jed Brown, Dr. John Gunnels King Abdullah University of Science and Technology November 2011
Motivation 7-point stencil operator 27-point stencil operator • PowerPC 450: a representation toexascalearchitectures • Increased parallelism: vectorizationand multi-issue pipeline • Silicon and power savings: in-order execution • Streaming numerical kernels: • At the heart of many scientific applications • Bottleneck in scientific codes
Why is tuning computation on the BG/P PowerPC 450 difficult? For (i=0; i<N; i++) A[i] = B[i] + B[i+1] Not Aligned • Utilizes features to improve efficiency • SIMDized fused floating point units
Why is tuning computation on the BG/P PowerPC 450 difficult? 1 load A 2 add B 3 load C 4 load D 5 add D 6 add E 7 add F 1 load A 2 add B 3 load C 6 add E 4 load D 7 add F 5 add D • Utilizes features to improve efficiency • SIMDized fused floating point units • Superscalar processor with In-order execution • at the core level
Engineering tactics • Divide and conquer: 3-point stencil • Optimize then replicate into larger stencils • Design focus: computer architecture • Fully utilize SIMD capabilities • Reduce pipeline stalls: unroll-and-jam and instructions interleaving (reordering) • Technique: assembly synthesis in Python • Accelerates prototyping • Simplifies source
3-point stencil SIMDization r3 = a2*W0 + a3*W1 + a4*W2 k Primary | Secondary Primary | Secondary Primary | Secondary • And more … • Regular SIMD Cross • Copy-primary Utilizing the SIMD-like unit features:
Mutate-mutate Vs. load-copy • Mutate-mutate • Fully utilizes the FPU • Requires less registers • Load-copy • Requires less load cycles
Unroll-and-jamreduce data hazards A[0] += q*B[0] stall A[0] += p*B[1] stall A[0] += q*B[2] stall A[0] += p*B[3] . . += 2 sources, 1 destinations A[0] += q*B[0] A[1] += q*B[6] A[0] += p*B[1] A[1] += p*B[7] . . += += 2 sources, 2 destinations For (i=0; i<4; i++) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] For (i=0; i<4; i+=2) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] A[i+1] += q*B[i+1][j] + p*B[i+1][j+1]
Pythonic code synthesisoverview PowerPC 450 simulator Python code GPR FPR Memory Register allocation Simulation log and debugging information Instruction scheduler and simulator Instructions (list of objects) C code generator C code template Documented C code
Pythonic code synthesisinstruction scheduling • Goal: • Run load/store and FMA instructions each cycle • Reduce read-after-write (RAW) data dependency hazards • Technique (Greedy) per cycle: • Create a list of instructions with no RAW hazards • Execute the instruction(s) that will require the minimal stall • Repeat until all instructions are executed
Conclusion • SIMDizing the computations of streaming numerical kernels is challenging • Assembly programming is important for “peak” hardware utilization • We introduced a code synthesis and simulation framework that facilitates: • A faster development-testing loop • Instruction reordering for improved efficiency • Cycle-accurate performance modeling