Exhaustive Phase Order Search Space Exploration and Evaluation

Exhaustive Phase Order Search Space Exploration and Evaluation by PrasadKulkarni (Florida State University)

Compiler Optimizations • To improve efficiency of compiler generated code • Optimization phases require enabling conditions • need specific patterns in the code • many also need available registers • Phases interact with each other • Applying optimizations in different orders generates different code

Phase Ordering Problem • To find an ordering of optimization phases that produces optimal code with respect to possible phase orderings • Evaluating each sequence involves compiling, assembling, linking, execution and verifying results • Best optimization phase ordering depends on • source application • target platform • implementation of optimization phases • Long standing problem in compiler optimization!!

Phase Ordering Space • Current compilers incorporate numerous different optimization phases • 15 distinct phases in our compiler backend • 15! = 1,307,674,368,000 • Phases can enable each other • any phase can be active multiple times • 1515 = 437,893,890,380,859,375 • cannot restrict sequence length to 15 • 1544 = 5.598 * 1051

Addressing Phase Ordering • Exhaustive Search • universally considered intractable • We are now able to exhaustively evaluate the optimization phase order space.

Re-stating of Phase Ordering • Earlier approach • explicitly enumerate all possible optimization phase orderings • Our approach • explicitly enumerate all function instances that can be produced by any combination of phases

Outline • Experimental framework • Exhaustive phase order space evaluation • Faster conventional compilation • Conclusions • Summary of my other work • Future research directions

Experimental Framework • We used the VPO compilation system • established compiler framework, started development in 1988 • comparable performance to gcc –O2 • VPO performs all transformations on a single representation (RTLs), so it is possible to perform most phases in an arbitrary order • Experiments use all the 15 re-orderable optimization phases in VPO • Target architecture was the StrongARM SA-100 processor

VPO Optimization Phases

Disclaimers • Did not include optimization phases normally associated with compiler front ends • no memory hierarchy optimizations • no inlining or other interprocedural optimizations • Did not vary how phases are applied • Did not include optimizations that require profile data

Benchmarks • 12 MiBench benchmarks; 244 functions

Terminology • Activephase – An optimization phase that modifies the function representation • Dormantphase – A phase that is unable to find any opportunity to change the function • Functioninstance – any semantically, syntactically, and functionally correct representation of the source function (that can be produced by our compiler)

Naïve Optimization Phase Order Space • All combinations of optimization phase sequences are attempted L0 d a c b L1 d a d a d a d a b c b c b c b c L2

Eliminating Consecutively Applied Phases • A phase just applied in our compiler cannot be immediately active again L0 d a c b L1 d a d a d a d a b c b c b c b c L2

Eliminating Dormant Phases • Get feedback from the compiler indicating if any transformations were successfully applied in a phase. L0 d a c b L1 d a d a d a b c c b b c L2

Identical Function Instances • Some optimization phases are independent • example: branch chaining & register allocation • Different phase sequences can produce the same code • r[2] = 1; • r[3] = r[4] + r[2]; • instruction selection r[3] = r[4] + 1; • r[2] = 1; • r[3] = r[4] + r[2]; • constant propagation r[2] = 1; r[3] = r[4] + 1; • dead assignment elimination r[3] = r[4] + 1;

Equivalent Function Instances sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; Source Code r[10]=0; r[12]=HI[a]; r[12]=r[12]+LO[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=M[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3; Register Allocation before Code Motion r[11]=0; r[10]=HI[a]; r[10]=r[10]+LO[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=M[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5; Code Motion before Register Allocation r[32]=0; r[33]=HI[a]; r[33]=r[33]+LO[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=M[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01; After Mapping Registers

Efficient Detection of Unique Function Instances • After pruning dormant phases there may be tens or hundreds of thousands of unique instances • Use a CRC (cyclic redundancy check) checksum on the bytes of the RTLs representing the instructions • Used a hash table to check if an identical or equivalent function instance already exists in the DAG

Eliminating Identical/Equivalent Function Instances • Resulting search space is a DAG of function instances L0 a c b L1 a d a d d c L2

Static Enumeration Results

Exhaustively enumerated the optimization phase order space tofind an optimal phase ordering with respect to code-size [Published in CGO ’06]

Determining Program Performance • Almost 175,000 distinct function instances, on average • largest enumerated function has 2,882,021 instances • Too time consuming to execute each distinct function instance • assemble  link  execute more expensive than compilation • Many embedded development environments use simulation • simulation orders of magnitude more expensive than execution • Use data obtained from a few executions to estimate the performance of all remaining function instances

Determining Program Performance (cont...) • Function instances having identical control-flow graphs execute each block the same number of times • Execute application once for each control-flow structure • Statically estimate the number of cycles required to execute each basic block • dynamic frequency measure = S (static cycles * block frequency)

Predicting Relative Performance – I 20 20 4 cycles 4 cycles 5 5 27 cycles 25 cycles 15 15 22 cycles 20 cycles 2 2 2 cycles 2 cycles 5 5 5 cycles 10 cycles 20 20 10 cycles 10 cycles Total cycles = 789 Total cycles = 744

Dynamic Frequency Results

Correlation – Dynamic Frequency Counts Vs. Simulator Cycles • Static performance estimation is inaccurate • ignored cache/branch misprediction penalties • Dynamic frequency counts may be sufficiently accurate • simplification of the estimation problem • most embedded systems have simpler architectures • We show strong correlation between our measure of performance and simulator cycles

Complete Function Correlation • Example: init_search in stringsearch

Leaf Function Correlation • Leaf function instances are generated when no additional phases can be successfully applied • Leaf instances provide a good sampling • represents the only code that can be generated by an aggressive compiler, like VPO • at least one leaf instance represents an optimal phaseordering for over 86% of functions • significant percent of leaf instances among optimal

Leaf Function Correlation Statistics • Pearson’s correlation coefficient • Accuracy of our estimate of optimal perf. Sxy – (SxSy)/n Pcorr = sqrt( (Sx2 – (Sx)2/n) * (Sy2 - (Sy)2/n) ) cycle count for best leaf Lcorr = cy. cnt for leaf with best dynamic freq count

Leaf Function Correlation Statistics (cont…)

Exhaustively evaluated the optimization phase order space tofind a near-optimal phase orderingwith respect to simulator cycles [Published in LCTES ’06]

Phase Enabling Interaction • b enables a along the path a-b-a a c b b c c a b a d

Phase Enabling Probabilities

Phase Disabling Interaction • b disables a along the path b-c-d a c b b c c a b a d

Disabling Probabilities

Faster Conventional Compiler • Modified VPO to use enabling and disabling phase probabilities to decrease compilation time # p[i] - current probability of phase i being active # e[i][j] - probability of phase j enabling phase i # d[i][j] - probability of phase j disabling phase i Foreach phase i do p[i] = e[i][st]; While (any p[i] > 0) do Select j as the current phase with highest probability of being active Apply phase j If phase j was active thenFor each phase i, where i != j do p[i] += ((1-p[i]) * e[i][j]) - (p[i] * d[i][j]) p[j] = 0

Probabilistic Compilation Results

Conclusions • Phase ordering problem • long standing problem in compiler optimization • exhaustive evaluation always considered infeasible • Exhaustively evaluated the phase order space • re-interpretation of the problem • novel application of search algorithms • fast pruning techniques • accurate prediction of relative performance • Analyzed properties of the phase order space to speedup conventional compilation • published in CGO’06, LCTES’06, submitted to TOPLAS

Challenges • Exhaustive phase order search is a severe stress test for the compiler • isolate analysis required and invalidated by each phase • produce correct code for all phase orderings • eliminate all memory leaks • Search algorithm needs to be highly efficient • used CRCs and hashes for function comparisons • stored intermediate function instances to reduce disk access • maintained logs to restart search after crash

VISTA • Provides an interactive code improvement paradigm • view low-level program representation • apply existing phases and manual changes in any order • browse and undo previous changes • automatically obtain performance information • automatically search for effective phase sequences • Useful as a research as well as teaching tool • employed in three universities • published in LCTES ’03, TECS ‘06

VISTA – Main Window

Faster Genetic Algorithm Searches • Improving performance of genetic algorithms • avoid redundant executions of the application • over 87% of executions were avoided • reduce search time by 62% • modify search to obtain comparable results in fewer generations • reduced GA generations by 59% • reduce search time by 35% • published in PLDI ’04, TACO ’05

Heuristic Search Algorithms • Analyzing the phase order space to improve heuristic algorithms • detailed performance and cost comparison of different heuristic algorithms • demonstrated the importance and difficulty of selecting the correct sequence length • illustrated the importance of leaf function instances • proposed modifications to existing algorithms, and new search algorithms • published in CGO ‘07

Dynamic Compilation • Explored asynchronous dynamic compilation in a virtual machine • demonstrated shortcomings of current popular compilation strategy • describe importance of minimum compiler utilization • discussed new compilation strategies • explored the changes needed to current compilation strategies to exploit free cycles • Will be published in VEE ‘07

Exhaustive Phase Order Search Space Exploration and Evaluation

Exhaustive Phase Order Search Space Exploration and Evaluation

Presentation Transcript

Space Exploration

Exhaustive search

Exhaustive search

Random/Exhaustive Search

Space Exploration

Space Exploration

Space Exploration

Exhaustive Search Attacks

Space Exploration

Space Exploration

Space Exploration

Space and Space Exploration

Space exploration

Exhaustive Search (ES):

Problem Solving through Exhaustive State Space Search

Exhaustive Phase Order Search Space Exploration and Evaluation

Unifying Local and Exhaustive Search

Space Exploration

Exhaustive Optimization Phase Order Space Exploration

Space Exploration

Exhaustive Search

Space Exploration