Code Optimization

Code Optimization

Outline • Optimizing Blockers • Memory alias • Side effect in function call • Understanding Modern Processor • Super-scalar • Out-of –order execution • More Code Optimization techniques • Performance Tuning • Suggested reading • 5.1, 5.7 ~ 5.16

5.1 Capabilities and Limitations of Optimizing CompliersReview on5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References

Example P387 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } }

Example P388 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } }

Example P392 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i]; }

Example P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT; for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x; }

Machine Independent Opt. Results • Optimizations • Reduce function calls and memory references within loop

Machine Independent Opt. Results • Performance Anomaly • Computing FP product of all elements exceptionally slow. • Very large speedup when accumulate in temporary • Memory uses 64-bit format, register use 80 • Benchmark data caused overflow of 64 bits, but not 80 Combine1 P385 Combine1 P388 Combine2 P392 Combine3 P394 Combine4

Optimization Blockers P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

Optimization Blocker: Memory Aliasing P394 • Aliasing • Two different memory references specify single location • Example • v: [3, 2, 17] • combine3(v, get_vec_start(v)+2) --> ? • combine4(v, get_vec_start(v)+2) --> ?

Optimization Blocker: Memory Aliasing • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

Optimizing Compilers • Provide efficient mapping of program to machine • register allocation • code selection and ordering • eliminating minor inefficiencies

Optimizing Compilers • Don’t (usually) improve asymptotic efficiency • up to programmer to select best overall algorithm • big-O savings are (often) more important than constant factors • but constant factors also matter • Have difficulty overcoming “optimization blockers” • potential memory aliasing • potential procedure side-effects

Limitations of Optimizing Compilers • Operate Under Fundamental Constraint • Must not cause any change in program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions.

Limitations of Optimizing Compilers • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • e.g., using an “int” in C for what could be an enumerated type

Limitations of Optimizing Compilers • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative

Optimization Blockers P380 • Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

Optimization Blockers P381 • Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

Optimization Blockers P381 • Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

5.7 Understanding Modern Processors

Modern CPU Design Figure 5.11P396 Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instructions Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache

2) 4) Fetch Control 1) Retirement Unit Register File Address Instruction Cache 3) 5) Instruction Decode Instructions operations Register Updates Predication OK? Functional units (1) (2) (3) (4) (5) (6) Integer /branch General Integer FP Add FP mult/div Load Store addr addr Operation results data data Data Cache (7)

Modern Processor P396 • Superscalar • Perform multiple operations on every clock cycle • Out-of-order execution • The order in which the instructions execute need not correspond to their ordering in the assembly program

Modern Processor P396 • Two main parts • Instruction Control Unit • Responsible for reading a sequence of instructions from memory • Generating from above instructions a set of primitive operations to perform on program data • Execution Unit

1) Instruction Control Unit • Instruction Cache • A special, high speed memory containing the most recently accessed instructions.

1) Instruction Control Unit • Instruction Decoding Logic • Take actual program instructions • Converts them into a set of primitive operations • Each primitive operation performs some simple task • Simple arithmetic, Load, Store • addl %eax, 4(%edx) --- three operations load 4(%edx)  t1 addl %eax, t1  t2 store t2, 4(%edx) • Register renaming P397 P398

2) Fetch Control • Fetch Ahead P396 • Fetches well ahead of currently accessed instructions • ICU has enough time to decode these • ICU has enough time to send decoded operations down to the EU

Fetch Control • Branch Predication P397 • Branch taken or fall through • Guess whether branch is taken or not • Speculative Execution P397 • Fetch, decode and execute only according to the branch prediction • Before the branch predication has been determined

Multi-functional Units • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division

Multi-functional Units Figure 5.12P400 • Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Integer Divide 36 36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 38 38

Execution Unit • Receives operations from ICU • Each cycle it may receive more than one operation • Operations are queued in buffer

Execution Unit • Operation is dispatched to one of multi-functional units, whenever • All the operands of an operation are ready • Suitable functional units are available • Execution results are passed among functional units • (7) Data Cache P398 • A high speed memory containing the most recently accessed data values

4) Retirement Unit P398 • Instructions need to commit in serial order • Misprediction • Exception • Updates Architecture status • Memory and register values

Translation Example P401 .L24: # Loop: imull (%eax,%edx,4),%ecx # t *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1

Understanding Translation Example P401 • Split into two operations • Load reads from memory to generate temporary result t.1 • Multiply operation just operates on registers imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

Understanding Translation Example P401 • Operands • Registers %eax does not change in loop. Values will be retrieved from register file during decoding imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

Understanding Translation Example P401 • Operands • Register %ecx changes on every iteration. • Uniquely identify different versions as • %ecx.0, %ecx.1, %ecx.2, … • Register renaming • Values passed directly from producer to consumers imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1

Understanding Translation Example P402 incl %edx • Register %edx changes on each iteration • Renamed as %edx.0, %edx.1, %edx.2, … incl %edx.0  %edx.1

Understanding Translation Example P402 cmpl %esi,%edx cmpl %esi, %edx.1  cc.1 • Condition codes are treated similar to registers • Assign tag to define connection between producer and consumer

Understanding Translation Example P402 jl .L24 jl-taken cc.1 • Instruction control unit determines destination of jump • Predicts whether target will be taken • Starts fetching instruction at predicted destination

Understanding Translation Example P401 jl .L24 jl-taken cc.1 • Execution unit simply checks whether or not prediction was OK • If not, it signals instruction control • Instruction control then “invalidates” any operations generated from misfetched instructions • Begins fetching and decoding instructions at correct target

%edx.0 load incl %edx.1 cmpl cc.1 jl %ecx.0 t.1 imull %ecx.1 Visualizing Operations Figure 5.13 P403 load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 • Operations • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency • Operands • Arcs shown only for operands that are passed within execution unit Time

%edx.0 load load incl %edx.1 %ecx.i +1 cmpl cc.1 jl %ecx.0 t.1 addl %ecx.1 Visualizing Operations Figure 5.14 P403 load (%eax,%edx,4)  t.1 iaddl t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Time • Operations • Same as before, except that add has latency of 1

3 Iterations of Combining Product Figure 5.15 P404 • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • Performance • Limiting factor becomes latency of integer multiplier • Gives CPE of 4.0

4 Iterations of Combining Sum Figure 5.16 P405 4 integer ops • Unlimited Resource Analysis • Performance • Can begin a new iteration on each clock cycle • Should give CPE of 1.0 • Would require executing 4 integer operations in parallel

Combining Sum: Resource Constraints Figure 5.18 P408

Combining Sum: Resource Constraints • Only have two integer functional units • Some operations delayed even though operands available • Set priority based on program order • Performance • Sustain CPE of 2.0

5.9 Converting to Pointer Code

Example P413 void combine4p(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT; for (; data < dend ; data++ ) x = x OPER *data; *dest = x; }

Code Optimization