Online Adaptive Code Generation and Tuning Jeffrey K. Hollingsworth and Ananta Tiwari
Why Automated Performance Tuning? • Software parameters impacts its performance. • Optimal parameter values are variable and un-predictable. • Parameters come from: • User code • Libraries • Compiler choices • For long programs at scale, • Performance of the machine will change during one “execution” Automated Parameter tuning can be used for adaptive tuning in complex software.
The framework – Active Harmony • Search based feedback driven empirical optimization • Provides a mechanism for applications to become adaptable by exporting tuning options • Monitors the program performance and suggests adaptation decisions • Decisions made by a central controller
Online tuning challenges & requirements • Minimal overhead • Avoid “bad” regions • Precise specification of search space – Constraint Specification Language (CSL) • Express relationships between tunable parameters
Parallel Rank Ordering Algorithm • All, but the best point of simplex moves • Computations can be done in parallel • N parallel evaluations for N+1 point simplex
Parameter tuning algorithm • Initial simplex construction • User-guided (via CSL construct) • Exploratory iterations • Constraining to allowable regions – penalization technique • Add penalty factor to the configurations that violate constraints
System design overview • Run-time management of: • Cost of search • Generating and compiling a set of code-variants • Multiple code-sections to tune • Code-generation utility • Allow selection of external code generation tools • Overhead reduction • Non-blocking relationship between code-generation and application execution
System design PM1, PM2, … PMN Search Steps (SS) Harmony Timeline Active Harmony Outlined code-section Code Transformation Parameters SSN SS2 SS1 Code Server Code Generation Tools v2s v1s.so vNs v1s vNs.so v2s.so compiler compiler compiler READY Signal Application Execution timeline Application stall_phase Performance Measurements (PM) PM2 PMN PM1
Code generation utility – Utah’s CHiLL • Polyhedral representation of loop-nests • Built Upon Omega Library Plus • CHiLL features • Provides a rich set of loop transformations • High-Level script interface to allow programmers and compilers to describe a set of code transformations • Recipe Library • Recipes can be evolved and reused over time
Experimental Results • Two platforms • umd-cluster (64 nodes, Intel Xeon dual-core nodes) – myrinet interconnect • Carver (1120 compute nodes, Intel Nehalem. two quad core processors) – infiniband interconnect • Code servers • umd-cluster – local idle machines • Carver – outsourced to a machine at umd • Three programs • SMG2000 • Poisson’s equation solver (PES) • PMLB
SMG2000 benchmark • Semi-coarsening multigrid on structured grids • Residual computation contains sparse matrix-vector multiply bottleneck, expressed in 4-deep loop nest • Key computation identified by HPCToolkit and outlined by ROSE Compiler for si = 0 to NS-1 for k = 0 to NZ-1 for j = 0 to NY-1 for i = 0 to NX-1 r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]] 46% of execution time Part of PERI auto-tuning effort.
SMG2000 results • Online tuning for smg_residual function • Tiling and unrolling • Up to 1.4x speedup within a single run
Poisson’s equation solver (PES) • Uses redblack successive over-relaxation method • Optimization method • Relaxation function: 7-point stencil • Triply nested loop – tiling the outermost two loops • Error function: sweeps through local grid to calculate L2-norm • Triply nested loop – tiling all loops and unrolling the innermost
Parallel Multi-block Lattice Boltzmann • Lattice Boltzmann method • Widely used method to study fluid dynamic systems • Six kernels • Initialization, collision, communication, streaming and physical • Streaming kernel accounts for more than 75% of computation time • Consists of five triply nested loops • Loop-nest outlining • Lots of memory copy operations
PMLB Optimization • Two phases of tuning • Loop fusion • First few iterations of the application evaluate different possibilities • Tiling and unrolling • Comparison to original untuned version (compiled with –O3) to • Harmonized application • Post-harmony runs
Code server sensitivity • How many parallel code-servers are needed to ensure that the stall_phase is not dominant? • Control parameters: problem-size (10243), number of processors (128) • Up to 128 new variants are generated at each search step • A typical setup for remaining experiments presented in this talk • Future plans for more robust study
PMLB (Carver Results) Harmonized application runs, on average, 1.14 times faster than the original. Best speedup for all PMLB runs on Carver: 1.48. Post-harmony runs are, on average, 1.37 times faster.
Cross-platform studies • Study how parameters differ for the two systems • Take harmony suggested parameters from one system and run a post-harmony run on another
Future work • Exploit the spatial locality of PRO • A-priori generation of code-variants • Retire “unreachable” code variants • Reachability measured in terms of the # of search steps • Run experiments on larger processor counts • Hopper (Cray XT5 machine at NERSC) • Auto-tuning for larger applications • PFLOTRAN (collaboration with PERI folks) r=2