Online Adaptive Code Generation and Tuning

Online Adaptive Code Generation and Tuning Jeffrey K. Hollingsworth and Ananta Tiwari

Why Automated Performance Tuning? • Software parameters impacts its performance. • Optimal parameter values are variable and un-predictable. • Parameters come from: • User code • Libraries • Compiler choices • For long programs at scale, • Performance of the machine will change during one “execution” Automated Parameter tuning can be used for adaptive tuning in complex software.

The framework – Active Harmony • Search based feedback driven empirical optimization • Provides a mechanism for applications to become adaptable by exporting tuning options • Monitors the program performance and suggests adaptation decisions • Decisions made by a central controller

Online tuning challenges & requirements • Minimal overhead • Avoid “bad” regions • Precise specification of search space – Constraint Specification Language (CSL) • Express relationships between tunable parameters

Parallel Rank Ordering Algorithm • All, but the best point of simplex moves • Computations can be done in parallel • N parallel evaluations for N+1 point simplex

Parameter tuning algorithm • Initial simplex construction • User-guided (via CSL construct) • Exploratory iterations • Constraining to allowable regions – penalization technique • Add penalty factor to the configurations that violate constraints

System design overview • Run-time management of: • Cost of search • Generating and compiling a set of code-variants • Multiple code-sections to tune • Code-generation utility • Allow selection of external code generation tools • Overhead reduction • Non-blocking relationship between code-generation and application execution

System design PM1, PM2, … PMN Search Steps (SS) Harmony Timeline Active Harmony Outlined code-section Code Transformation Parameters SSN SS2 SS1 Code Server Code Generation Tools v2s v1s.so vNs v1s vNs.so v2s.so compiler compiler compiler READY Signal Application Execution timeline Application stall_phase Performance Measurements (PM) PM2 PMN PM1

Code generation utility – Utah’s CHiLL • Polyhedral representation of loop-nests • Built Upon Omega Library Plus • CHiLL features • Provides a rich set of loop transformations • High-Level script interface to allow programmers and compilers to describe a set of code transformations • Recipe Library • Recipes can be evolved and reused over time

Experimental Results • Two platforms • umd-cluster (64 nodes, Intel Xeon dual-core nodes) – myrinet interconnect • Carver (1120 compute nodes, Intel Nehalem. two quad core processors) – infiniband interconnect • Code servers • umd-cluster – local idle machines • Carver – outsourced to a machine at umd • Three programs • SMG2000 • Poisson’s equation solver (PES) • PMLB

SMG2000 benchmark • Semi-coarsening multigrid on structured grids • Residual computation contains sparse matrix-vector multiply bottleneck, expressed in 4-deep loop nest • Key computation identified by HPCToolkit and outlined by ROSE Compiler for si = 0 to NS-1 for k = 0 to NZ-1 for j = 0 to NY-1 for i = 0 to NX-1 r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]] 46% of execution time Part of PERI auto-tuning effort.

SMG2000 results • Online tuning for smg_residual function • Tiling and unrolling • Up to 1.4x speedup within a single run

Poisson’s equation solver (PES) • Uses redblack successive over-relaxation method • Optimization method • Relaxation function: 7-point stencil • Triply nested loop – tiling the outermost two loops • Error function: sweeps through local grid to calculate L2-norm • Triply nested loop – tiling all loops and unrolling the innermost

Parallel Multi-block Lattice Boltzmann • Lattice Boltzmann method • Widely used method to study fluid dynamic systems • Six kernels • Initialization, collision, communication, streaming and physical • Streaming kernel accounts for more than 75% of computation time • Consists of five triply nested loops • Loop-nest outlining • Lots of memory copy operations

PMLB Optimization • Two phases of tuning • Loop fusion • First few iterations of the application evaluate different possibilities • Tiling and unrolling • Comparison to original untuned version (compiled with –O3) to • Harmonized application • Post-harmony runs

Code server sensitivity • How many parallel code-servers are needed to ensure that the stall_phase is not dominant? • Control parameters: problem-size (10243), number of processors (128) • Up to 128 new variants are generated at each search step • A typical setup for remaining experiments presented in this talk • Future plans for more robust study

Search evolution

PMLB (Carver Results) Harmonized application runs, on average, 1.14 times faster than the original. Best speedup for all PMLB runs on Carver: 1.48. Post-harmony runs are, on average, 1.37 times faster.

Cross-platform studies • Study how parameters differ for the two systems • Take harmony suggested parameters from one system and run a post-harmony run on another

Future work • Exploit the spatial locality of PRO • A-priori generation of code-variants • Retire “unreachable” code variants • Reachability measured in terms of the # of search steps • Run experiments on larger processor counts • Hopper (Cray XT5 machine at NERSC) • Auto-tuning for larger applications • PFLOTRAN (collaboration with PERI folks) r=2

Online Adaptive Code Generation and Tuning

Online Adaptive Code Generation and Tuning

Presentation Transcript

Code Tuning Techniques

Code Tuning Strategies and Techniques

Code Tuning Strategies and Techniques

Code Tuning and Optimization

Performance and Code Tuning

NITRO : A Framework for Adaptive Code Variant Tuning

Code Tuning and Optimization

Code Tuning Techniques

Code Tuning and Optimizations

Code Generation

Profiling and Tuning OpenACC Code

Code Generation

Code Tuning Strategies and Techniques

Code Generation

Code Tuning

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

Code Generation

Code Tuning

Code Generation

Code Generation

Performance and Code Tuning Overview