De-optimization

Finding the Limits of Hardware Optimization through Software De-optimization De-optimization Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Presented By:

Outline: • Introduction • Project Structure • Judging de-optimizations • What does a de-op look like? • General Areas of Focus • Instruction Fetching and Decoding • Instruction Scheduling • Instruction Type Usage (e.g. Integer vs. FP) • Branch Prediction • Conclusion

What are we doing? • De-optimization? That's crazy! Why??? • In the world of hardware development, when optimizations are compared, the comparisons often concern just how fast a piece of hardware can run an algorithm • Yet, in the world of software development, the hardware is often a distant afterthought • Given this dichotomy, how relevant are these standard analyses and comparisons?

What are we doing? • So, why not find out how bad it can get? • By de-optimizing software, we can see how bad algorithmic performance can be if hardware isn't considered • At a minimum, we want to be able to answer two questions: • How good of a compiler writer must someone be? • How good of a programmer must someone be?

Our project structure • For our research project: • We have been studying instruction fetching/ decoding/ scheduling and branch optimization • We have been using knowledge of optimizations to design and predict de-optimizations • We have been studying the Opteron in detail

Our project structure • For our implementation project: • We will choose de-optimizations to implement • We will choose algorithms that may best reflect our de-optimizations • We will implement the de-optimizations • We will report the results

Judging de-optimizations (de-ops) • We need to decide on an overall metric for comparison • Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm. • So, our metric of choice will be CPU clock cycles

Judging de-optimizations (de-ops) • With our metric, we can compare de-ops, but should we? • Inevitably, we will ask which de-ops had greater impact, i.e. caused the greatest jump in clocks. So, yes, we should • But this has to be done very carefully since an intended de-op may not be the actual or full cause of a bump in clocks. It could be a side effect caused by the new code combination • Of course, this would be still be some kind of a de-op, just not the intended de-op

What does a de-op look like? • Definition: A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question • Is an infinite loop a de-op? --NO Why not? It tells us nothing about the hardware • Is a loop that executes more cycles than necessary a de-op? -- NO Again, it tells us nothing about the CPU • Is a combination of instructions that causes increased branch mispredictions a de-op?-- YES

General Areas of Focus • Given some CPU, what aspects can we optimize code for? These aspects will be our focus for de-optimization. • In general, when optimizing software, the following are the areas to focus on: • Instruction Fetching and Decoding • Instruction Scheduling • Instruction Type Usage (e.g. Integer vs. FP) • Branch Prediction • These will be our areas for de-optimization

Some General Findings • In class, when we discussed dynamic scheduling, for example, our team was not sanguine about being able to truly de-optimize code • In fact, we even imagined that our result may be that CPUs are now generally so good that true de-optimization is very difficult to achieve. In principle, we still believe this • In retrospect, we should have been more wise. Just like Plato’s Forms, there is a significant, if not absolute, difference between something imagined in the abstract and its worldly representation. There can be no perfect circles in the real world • Thus, in practice, as Gita has stressed, CPU designers made choices in their designs that were driven by cost, energy consumption, aesthetics, etc.

Some General Findings • These choices, when it comes time to write software for a CPU, become idiosyncrasies that must be accounted for when optimizing • For those writing optimal code, they are hassles that one must pay attention to • For our project team, these idiosyncrasies are potential "gold mines" for de-optimization • In fact, the AMD Opteron (K10 architecture) exhibits a number of idiosyncrasies. You will see some these today

Examples of idiosyncrasies • AMD Opetron (K10) • The dynamic scheduling pick window is 32 bytes length while instructions can be 1 - 16 bytes in length. So, scheduling can be adversely affected by instruction length • The branch target buffer (BTB) can only maintain 3 branch history entries per 16 bytes • Branch indicators are aligned at odd numbered positions within 16 byte code blocks. So, 1-byte branches like return instructions, if misaligned will be miss predicted

Examples of idiosyncrasies • Intel i7 (Nehalem) • The number of read ports for the register file is too small. This can result in stalls when reading registers • Instruction fetch/decode bandwidth is limited to 16 bytes per cycle. Instruction density can overwhelm the predecoder, which can only manage 6 instructions (per 16 bytes) per cycle

Format of the de-op discussion • In the upcoming discussion of de-optimization techniques, we will present... • ...an area of the CPU that it derives from • ...some, hopefully, illuminating title • ...a general characterization of the de-op. This characterization may apply to many different CPU architectures. Generally, each of these represents a choice that may be made by a hardware designer • ...a specific characterization of the de-op on the AMD Opteron. This characterization will apply only to the Opterons on Hydra

So, without further adieu... The De-optimizations

Instruction Fetching and Decoding Decoding Bandwidth Execution Latency

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth [AMD05] Scenario #1 • Many CISC architectures offer combined load and execute instructions as well as the typical discrete versions • Often, using the discrete versions can decrease the instruction decoding bandwidth Example: add rax, QWORD PTR [foo]

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #1 - The Opteron • The Opteron can decode 3 combined load-execute (LE) instructions per cycle • Using discrete LE instruction will allow us to decrease the decode rate Example: movrbx, QWORD PTR [foo] add rax, rbx

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth (cont'd) Scenario #2 • Use of instruction with longer encoding rather than those with shorter encoding to decrease the average decode rate by decreasing the number of instruction that can fit into the L1 instruction cache • This also effectively “shrinks” the scheduling pick window • For example, use 32-bit displacements instead of 8-bit displacements and 2-byte opcode form instead of 1-byte opcode form of simple integer instructions

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #2 - The Opteron • The Opteron has short and long variants of a number of its instructions, like indirect add, for example. We can use the long variants of these instructions in order to drive down the decode rate • This will also have the affect of “shrinking” the Opteron’s 32-byte pick window for instruction scheduling. Example of long variant: 81 C0 78 56 34 12 add eax, 12345678h ;2-byte opcode form 83 C3 FB FF FFFFadd ebx, -5 ;32-bit immediate value 0F 84 05 00 00 00 jz label1 ;2-byte opcode, 32-bit immediate ;value

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth (cont'd) A balancing act • The scenarios for this de-optimization have flip sides that could make them difficult to implement • For example, scenario #1 describes using discrete load-execute instructions in order to decrease the average decode rate. However, sometimes discrete load-execute instructions are called for: • The discrete load-execute instructions can provide the scheduler with more flexibility when scheduling • In addition, on the Opteron, they consume less of the 32-byte pick window, thereby giving the scheduler more options

Instruction Fetching and Decoding • De-optimization #1 - Decrease Decoding Bandwidth (cont'd) When could this happen? • This de-optimization could occur naturally when: • A compiler does a very poor job • The memory model forces long version encodings of instructions, e.g. 32-bit displacements Our prediction for implementation • We predict mixed results when trying to implement this de-optimization

Instruction Fetching and Decoding • De-optimization #2 - Increase execution latency [AMD05] Scenario • CPUs often have instructions that can perform almost the same operation • Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized

Instruction Fetching and Decoding • De-optimization #2 - Increase execution latency In Practice - The Opteron • We can use 16-bit LEA instruction, which is a VectorPath instruction to reduce the decode bandwidth and increase execution latency • The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles • Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization.

Instruction Fetching and Decoding • De-optimization #2 - Increase execution latency (cont'd) When could this happen? • This de-optimization could occur if the user simply does the following: float a, b; b = a / 100.0; instead of: float a, b; b = a * 0.01 Our prediction for implementation • We expect this de-op to be clearly reflected in an increase in clock cycles

Instruction Scheduling Address Generation interlocks Register Pressure Loop Re-rolling

Instruction Scheduling • De-optimization #1 - Address-generation interlocks [AMD05] Scenario • Scheduling loads and stores whose addresses cannot be calculated quickly ahead of loads and stores that require the declaration of a long dependency chain • In order to generate their addresses can create address-generation interlocks. Example: add ebx, ecx ; Instruction 1 moveax, DWORD PTR [10h] ; Instruction 2 movedx, DWORD PTR [24h] ; Place lode above ; instruction 3 to avoid AGI stall movecx, DWORD PTR [eax+ebx] ; Instruction 3

Instruction Scheduling • De-optimization #1 - Address-generation interlocks (cont'd) In Practice - The Opteron • The processor schedules instructions that access the data cache (loads and stores) in program order. • By randomly choosing the order of loads and stores, we can seek address-generation interlocks. Example: add ebx, ecx ; Instruction 1 moveax, DWORD PTR [10h] ; Instruction 2 (fast address calc.) movecx, DWORD PTR [eax+ebx] ; Instruction 3 (slow address calc.) movedx, DWORD PTR [24h] ; This load is stalled from accessing ; the data cache due to the long ; latency caused by generating the ; address for instruction 3

Instruction Scheduling • De-optimization #1 - Address-generation interlocks (cont'd) When could this happen? • This happen when we have a long chain dependency of loads and stores addresses a head of one that can be calculated quickly. Our prediction for implementation: • We expect an increasing in the number of clock cycles by using this de-optimization technique.

Instruction Scheduling • De-optimization #2 - Increase register pressure [AMD05] Scenario • Avoid pushing memory data directly onto the stack and instead load it into a register to increase register pressure and create data dependencies. Example: In Practice - The Opteron • Permit code that first loads the memory data into a register and then pushes it into the stack to increase register pressure and allows data dependencies. Example: push mem movrax, mem push rax

Instruction Scheduling • De-optimization #2 - Increase register pressure When could this happen? • This could take place by different usage of instruction load and store, when we have a register and we load an instruction into a register and we push it into a stack Our prediction for implementation: • We expect the performance will be affected by increasing the register pressure

Instruction Scheduling • De-optimization #3 - Loop Re-rolling Scenario • Loops not only affect branch prediction. They can also affect dynamic scheduling • How ? • Let instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be In Practice - The Opteron • Given that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC

Instruction Scheduling • De-optimization #3 - Loop Re-rolling When could this happen? • Easily, in C, this would be two consecutive loops each containing one or more many instructions such that the loops could be combined Our prediction for implementation • We expect this de-op to be clearly reflected in an increase in clock cycles Example:

Instruction Type Usage Store-to-load dependency Costly Instruction

Instruction Type Usage • De-optimization #1 – Store-to-load dependency Scenario • Store-to-load dependency takes place when stored data needs to be used shortly. • This is commonly used. • This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently. Example:

Instruction Type Usage • De-optimization #1 – Store-to-load dependency When could this happen? • In many instructions , when we load the data which is stored shortly. Our prediction for implementation: • We expect this de-optimization results in lower performance to the load store unit.

Instruction Type Usage • De-optimization #2 – Using equivalent more costly instruction Scenario • Some instructions can do the same job but with more cost in term of number of cycles In Practice- The Opetron • Integer division for Opetron costs 22-47 cycles for signed, and 17-41 unsigned • While it takes only 3-8 cycles for both signed and unsigned multiplication

Instruction Type Usage • De-optimization #2 – Using equivalent more costly instruction (cont'd) When could this happen? • Integer division and multiplication for many codes Example: inti, j, k, m; (a)- m = i / j / k; (b)- m = i / (j * k); Our prediction for implementation: • This de-optimization significantly will increase number of cycles

Branch Prediction Branch Density Branch Patterns Non-predictable Instructions

Branch Prediction • De-optimization #1 - Branch Density Scenario Compare R1, 10 Jump-if-equal handle_10 Jump-if-less-than handle_lt_10 Call set_up_for_gt10 • There are 3 consecutive branch instructions that must be predicted • Whether or not a bubble is created is dependent upon the hardware • However, at some point, the hardware can only predict so much and pre-load so much code • This de-optimization attempts to overwhelm the CPUs ability to predict a branch code

Branch Prediction • De-optimization #1 - Branch Density (cont'd) In Practice - The Opteron Scenario DEC R1 JZ handle_n1 DEC R1 JZ handle_n2 DEC R1 JZ handle_n3 DEC R1 JZ handle_n4 • Most branch instruction are two bytes long. • These 8 instructions can take up as little as 16 bytes on an Opteron

Branch Prediction 401399: 8b 44 24 10 mov 0x10(%esp),%eax 40139d: 48 dec %eax 40139e: 74 7a je 40141a <_mod_ten_counter+0x8a> 4013a0: 8b 0f mov (%edi),%ecx 4013a2: 74 1b je 4013bf <_mod_ten_counter+0x2f> 4013a4: 49 dec %ecx 4013a5: 74 1f je 4013c6 <_mod_ten_counter+0x36> 4013a7: 49 dec %ecx 4013a8: 74 25 je 4013cf <_mod_ten_counter+0x3f> 4013aa: 49 dec %ecx 4013ab: 74 2b je 4013d8 <_mod_ten_counter+0x48> 4013ad: 49 dec %ecx 4013ae: 74 31 je 4013e1 <_mod_ten_counter+0x51> 4013b0: 49 dec %ecx 4013b1: 74 37 je 4013ea <_mod_ten_counter+0x5a> 4013b3: 49 dec %ecx 4013b4: 74 3d je 4013f3 <_mod_ten_counter+0x63> 4013b6: 49 dec %ecx 4013b7: 74 43 je 4013fc <_mod_ten_counter+0x6c> 4013b9: 49 dec %ecx 4013ba: 74 49 je 401405 <_mod_ten_counter+0x75> 4013bc: 49 dec %ecx 4013bd: 74 4f je 40140e <_mod_ten_counter+0x7e>

Branch Prediction • De-optimization #1 - Branch Density (cont'd) In Practice - The Opteron • However, the Opteron's BTB (Branch Target Buffer) can only maintain 3 (used) branch entries per (aligned) 16 bytes of code [AMD05] • Thus, the Opteron cannot successfully maintain predictions for all of the branches within previous sequence of instructions • Why? 9 branch indicators associated with byte# 0,1,3,5,7,9,11, 13, & 15 • Only 3 branch selectors

Branch Prediction • De-optimization #1 - Branch Density (cont'd) When could this happen? • Having dense branches is not that unusual. Most compilers translate case/switch statement to a comparison chain, which is implemented as dec/jz instruction. Our prediction for implementation • By seeking dense branches we expect the branch prediction unit to have more mispredictions.

Branch Prediction • De-optimization #2 - Branch Patterns Scenario Consider the following algorithm: Algorithm Even-Number-Sieve • Input: An array of random numbers • Output:An array of numbers where the odd numbers have been replaced with zero • Even-Number-Sieve must have a branch within that depends upon whether the current array entry is even or odd • Given an even probability distribution, there will be no pattern that can be selected that will yield better than 50% success

Branch Prediction • De-optimization #2 - Branch Patterns • Let the word “parity” refer to a branch that has an even chance of being taken as not taken • The Odd/Even branch within Even-Number-Sieve has parity • Furthermore, it has no simple pattern that can be predicted • Yet, data need not be random. All we need is a pattern whose repetition outstrips the hardware bits used to predict it • In fact, given the right pattern, branch prediction can be forced to perform with a success rate well less than 50%

Branch Prediction • De-optimization #3 - Unpredictable Instructions Scenario • Some CPUs restricts only one branch instruction be within a certain number bytes • If this exceeded or if branch instructions are not aligned properly, then branches cannot be predicted • Mispredictioncan takeplace with undesirable usage of recursion [AMD05] • Far control transfer usually mispredictedfor different types of architecture

Branch Prediction • De-optimization #3 - Unpredictable Instructions In practice - The Opetron • RET instruction may only take up one byte • If a branch instruction immediately precedes a one byte RET instruction, then RET cannot be predicted. Example: • One byte RET instruction can cause a mispredictioneven if we one branch instruction per 16 bytes.

Branch Prediction • De-optimization #3 - Unpredictable Instructions When could this happen? • When branch instructions are not aligned properly, then branches cannot be predicted • Having a branch ended at an even address followed by single-byte return instruction, will cause a conflicting of using the branch selector and will cause a miss prediction most of the time.

De-optimization

De-optimization

Presentation Transcript

OPTIMIZATION

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

OPTIMIZATION

Optimization