Dynamic Binary Optimization
Dynamic Binary Optimization. Presenter Kim Jin Chul. Contents. 1. Overview of Applying Optimization on VMs. 2. Dynamic Program Behavior. 3. Profiling. 4. Optimizing Translation Blocks. addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory
Dynamic Binary Optimization
E N D
Presentation Transcript
Dynamic Binary Optimization Presenter Kim Jin Chul
Contents 1 Overview of Applying Optimization on VMs 2 Dynamic Program Behavior 3 Profiling 4 Optimizing Translation Blocks
addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory add r7, r17, r7 ; perform add of %edx addi r16, r4, 4 ; add 4 to %eax stwx r7, r2, r16 ; store %edx value into memory Classical Optimizations addl %edx, 4(%eax) movl 4(%eax), %edx Translation from IA-32 to PowerPC code. Adopt a Common Subexpression Elimination addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory add r7, r17, r7 ; perform add of %edx stwx r7, r2, r16 ; store %edx value into memory
Optimization Based on Profiling Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0 Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0 Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0 Compensation code R1 ← R2 + R3 Basic Block B ... R6 ← R1 + R6 ... ... Basic Block B ... R6 ← R1 + R6 ... ... Basic Block B ... R6 ← R1 + R6 ... ... use Basic Block C L1: R1 ← 0 ... ... Basic Block C L1: R1 ← 0 ... ... Basic Block C L1: R1 ← 0 ... ... def
Compensation code R1 ← R2 + R3 Basic Block B L2:... R6 ← R1 + R6 ... ... Optimization Based on Profiling Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0 Superblock ... ... R3 ← ... R7 ← ... Br L2 if R3 != 0 R1 ← 0 ... ... Basic Block B ... R6 ← R1 + R6 ... ... Basic Block C L1: R1 ← 0 ... ...
Stages: Interpret Basic translation Optmized block Highly optimized blocks Fast startup Very slow startup Slow steady state Fast steady state Simple profiling Extensive profiling A staged optimization system Interpreter Binary memory image Basic block cache Code cache Profile data Optimizer Translator Emulation manager
Dynamic Program Behavior • Dynamic control flow is highly predictable . . R3 ← 100 loop: R1 ← mem(R2) Br found if R1 == –1 R2 ← R2 + 4 R3 ← R3 – 1 Br loop if R3 != 0 . . found: . . .
50% 40% 30% 20% 10% 0% 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% Dynamic Program Behavior • Distribution of taken conditional branches Fraction of static conditional branches Percent taken Predominantly not taken : 28% Predominantly taken : 42% Back...
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 176.gcc 181.mcf 197.parser 252.eon 256.bzip2 171.swim 173.applu 177.mesa 187.facerec 189.lucas Dynamic Program Behavior • Consistency of conditional branches • The high percentage consists of backward branches Dynamic branches decided same as previous time Benchmark SPEC
25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 >9 Percent of indirect jumps Number of different destinations Dynamic Program Behavior • The predictability of indirect jumps • Some jump destination addresses seldom change
0.7 0.6 0.5 0.4 Fraction with constant value 0.3 0.2 0.1 0 All Add/Sub Load Logic Shift Set Instruction type Dynamic Program Behavior • The predictability of data value Static instructions always compute the same value Static Dynamic instructions execute the static instructions Dynamic
Profiling • The process of collecting instruction and data statistics for an executing program • Optimization based on profiling work Interpreter Binary memory image Basic block cache Code cache Profile data Optimizer Translator Emulation manager Back...
A B C D E F The Role of Profiling • Traditional profiling HLL Program Compiler Frontend Compiler Backend Instrumented Code Instrumented Code Program Execution Program Statistics Optimizing Compiler Optimized Binary Test Data
A B D E The Role of Profiling • On-the-fly profiling in a dynamic optimizing VM Partial Program Statistics Translator/ Optimizer Program Binary Interpreter Program Data
Types of Profiles • Several types of profile data • How frequently different code regions are being executed? • It can be used to decide the level of optimization • Is control flow predictability? • It may be used as the basis for gathering and rearranging basic blocks • Rearranged basic blocks get a chance to be merged superblock
A A 65 50 15 B C B C 50 15 50 12 13 17 48 D D 38 25 10 2 E E 15 48 F F 17 Types of Profiles A basic block profile A edge profile
Collecting Profiles • Instrumentation-based profiling • Specific program-related events and counts all instances of the events being profiled • Software-based Vs Hardware-based • Speed? Support? Flexibility? • Sampling-based profiling • Program runs in its unmodified form, the program is interrupted and event is captured • Instrumentation Vs Sampling • Overhead : Instrumentation < Sampling • Sampling causes traps!
Branch PC HASH Takencount Not-takencount PC Profiling During Interpretation Instruction function list..branch_conditional(inst) { BO = extract(inst, 25, 5); BI = extract(inst, 20, 5); displacement = extract(inst, 15, 14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken); PC = PC + displacement; Else profile_cnt(profile_addr, nottaken); PC = PC + 4; } Profile Table for Collecting an Edge Profile During Interpretation PowerPC Branch Conditional Interpreter Routine
Profiling Translated Code increment edge counter (i)if (counter (i) > trigger) then invoke optimizerelse branch to fall-through basic block increment edge counter (j)if (counter (j) > trigger) then invoke optimizerelse branch to target basic block Edge Profiling Code Inserted into Stubs of a Binary Translated Basic Block Emulation Stages
Profiling Overhead • For profiling during interpretation, occurring 10-20% overhead • Profiling overheads can be reduced • To reduce the number of instrumentation points by selecting a smaller set of key points
Optimizing Translation Blocks • Two-part strategy for optimzing • Using dominant control flow for enhancing memory locality • Making a translation blocks larger • Traces, Superblocks, Tree groups • Two parts of the strategy are actually relatively independent
Improving Locality • Two kinds of memory localities • Spatial locality • Access to a memory location is soon followed by a memory access to an adjacent memory location • Temporal locality • Access to a memory location is accessed again in the near future
3 A 30 70 D B 1 29 68 2 E F C 29 68 1 G 97 1 Improving Locality • Example code sequence A Br cond1 == true B Br cond2 == false C Br uncond D Br cond3 == true E Br uncond F G Br cond4 == true
3 A 30 70 D B 1 29 68 2 B E F C 29 68 1 G 97 1 Improving Locality • Rearrange the blocks in memory A Br cond1 == false D Br cond3 == true E G Br cond4 == true Br uncond Br cond2 == false C Br uncond F Br uncond
Improving Locality A • Procedure Inlining • Positive & NegativeEffect? A X X Y A Y Z Call proc xyz Proc xyz B B X B ... ... ... Y K K Z K X X Return Call proc xyz L Z Y L Z L
3 A Trace 1 Trace 2 30 70 Traces D B Superblocks Trace 3 1 29 68 2 E F C 29 68 1 Relations between Superblocks and Traces G 97 1 Traces • Trace • A contiguous sequence • Both side entrances and side exits
3 A A 30 70 D D B B 1 29 68 2 E E F C F C 29 68 1 G G G G 97 1 Superblocks • Superblocks • Regions of code with only one entry and one or more exit points
B B Superblocks A A Br cond1 == false Br cond1 == false D D Br cond3 == true Br cond3 == true E E G G Br cond4 == true Br cond4 == true Br uncond Br uncond Br cond2 == false Br cond2 == false C C G Br uncond Br cond4 == true Br uncond F F G Br cond4 == true Br uncond Br uncond
A D B E F C G G G Tree Groups • Tree groups • Regions of code with only one entry and one or more exit points Figure 4.7
SPEC benchmarks • Integer SPEC benchmark • 176.gcc – GNU Compiler • 181.mcf – Combinatorial Optimization • 197.parset – Word Processor • 252.eon – Computer Visualization • 256.bzip2 – Compression • Floating-Point SPEC benchmark • 171.swim – Shallow Water Modeling • 173.applu – Parabolic • 187.facerec – Imageprocessing • 189.lucas – Number Theory Back...