Improving Streaming Numerical Kernel Performance on IBM Blue Gene/P PowerPC 450

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. AronAhmadia Collaborators: Jed Brown, Dr. John Gunnels King Abdullah University of Science and Technology November 2011

Motivation 7-point stencil operator 27-point stencil operator • PowerPC 450: a representation toexascalearchitectures • Increased parallelism: vectorizationand multi-issue pipeline • Silicon and power savings: in-order execution • Streaming numerical kernels: • At the heart of many scientific applications • Bottleneck in scientific codes

Why is tuning computation on the BG/P PowerPC 450 difficult? For (i=0; i<N; i++) A[i] = B[i] + B[i+1] Not Aligned • Utilizes features to improve efficiency • SIMDized fused floating point units

Why is tuning computation on the BG/P PowerPC 450 difficult? 1 load A 2 add B 3 load C 4 load D 5 add D 6 add E 7 add F 1 load A 2 add B 3 load C 6 add E 4 load D 7 add F 5 add D • Utilizes features to improve efficiency • SIMDized fused floating point units • Superscalar processor with In-order execution • at the core level

Engineering tactics • Divide and conquer: 3-point stencil • Optimize then replicate into larger stencils • Design focus: computer architecture • Fully utilize SIMD capabilities • Reduce pipeline stalls: unroll-and-jam and instructions interleaving (reordering) • Technique: assembly synthesis in Python • Accelerates prototyping • Simplifies source

3-point stencil SIMDization r3 = a2*W0 + a3*W1 + a4*W2 k Primary | Secondary Primary | Secondary Primary | Secondary • And more … • Regular SIMD Cross • Copy-primary Utilizing the SIMD-like unit features:

Mutate-mutate Vs. load-copy • Mutate-mutate • Fully utilizes the FPU • Requires less registers • Load-copy • Requires less load cycles

Unroll-and-jamreduce data hazards A[0] += q*B[0] stall A[0] += p*B[1] stall A[0] += q*B[2] stall A[0] += p*B[3] . . += 2 sources, 1 destinations A[0] += q*B[0] A[1] += q*B[6] A[0] += p*B[1] A[1] += p*B[7] . . += += 2 sources, 2 destinations For (i=0; i<4; i++) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] For (i=0; i<4; i+=2) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] A[i+1] += q*B[i+1][j] + p*B[i+1][j+1]

Unroll-and-jamdata reuse

Pythonic code synthesisoverview PowerPC 450 simulator Python code GPR FPR Memory Register allocation Simulation log and debugging information Instruction scheduler and simulator Instructions (list of objects) C code generator C code template Documented C code

Pythonic code synthesisinstruction scheduling • Goal: • Run load/store and FMA instructions each cycle • Reduce read-after-write (RAW) data dependency hazards • Technique (Greedy) per cycle: • Create a list of instructions with no RAW hazards • Execute the instruction(s) that will require the minimal stall • Repeat until all instructions are executed

Unroll-and-jam effects27-point stencil

Kernel and L2 effects7-point stencil

Unroll-and-jam effects3-point stencil

Instruction scheduling optimization formulation

Conclusion • SIMDizing the computations of streaming numerical kernels is challenging • Assembly programming is important for “peak” hardware utilization • We introduced a code synthesis and simulation framework that facilitates: • A faster development-testing loop • Instruction reordering for improved efficiency • Cycle-accurate performance modeling

Improving Streaming Numerical Kernel Performance on IBM Blue Gene/P PowerPC 450

Improving Streaming Numerical Kernel Performance on IBM Blue Gene/P PowerPC 450

Presentation Transcript

Gene Expression Profiling

Numerical Solution of Differential Equations

The zen of async : Best practices for best performance

Chapter 14: Mendel and the Gene Idea

Parameterized Complexity Part I – Basics, Kernels and Branching

Gene Concept

Streaming Protocol Suite

Cancer Gene therapy

Numerical Analysis for Engineers

Regulation of Gene Expression in Multicellular Organisms

Semi-Numerical String Matching

Advanced Gene Technology

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization

Regulation of Gene Expression Chapter 18

Blue LED with Phosphor

Regulation of Gene Expression

Optimizing Reproduction in Dairy Cattle

Gene Therapy

Gene flow

REGULATION OF GENE EXPRESSION PROKARYOTES

NUMERICAL DIFFERENTIATION AND INTEGRATION

One Network, Different Models