In-Order vs. Out-of-Order

In-Order vs. Out-of-Order • Topics • Branch prediction (wrap-up) • Course schedule and planning • Technology overview • Hardware vs. Compiler • In-Order vs. Out-of-Order • Precise exception handling

Pentium II- Branch Prediction • Two-Level Scheme • Yeh & Patt, ISCA ‘93 • Keep shift register showing past k outcomes for branch • Use to index 2k entry table • Each entry provides 2-bit, saturating counter predictor • Very effective for any deterministic branching pattern Microprocessor Report March 27, 1995

21264 Branch Prediction Logic • Purpose: Predict whether or not branch taken • 35Kb of prediction information • 8% of total die size • Claim 0.7--1.0% misprediction

Future Directions

Course Project • Start thinking about project ideas • Survey some aspect of computer architecture • Office hours: Monday and Wednesday 3-5pm, or other • Find group members (encourage 3 per group) • Project abstract (due Sept 26) • Project proposal (due Oct 24) • Project report (due Dec 7) • Project status (due Nov 21) • Project presentations (due Dec 12 and Dec 14)

The Product • Computers are becoming ubiquitous • Mainframes -> Workstations -> Desktops -> Laptop • Handheld-> Worn ->Implanted • Key metrics for evaluating technological progress: • Power • Size • Performance • Cost • Reliability

Compiler: Software product division Developed after ISA is defined Computer scientists Hardware: Hardware product division Development defines ISA Computer engineers The Artificial Partition • Algorithm | Language | Compiler • | ISA | • Architecture | Microarchitecture | Circuit | Physical • What are the power, size, performance, cost, and reliability reasons for this partition

Instruction-Level Parallelism (ILP) • Instruction-Level Parallelism (ILP), the concurrent execution of independent assembly instructions, is a cost effective way to extract performance from programs • Hardware scheduling (and speculation) have had diminishing returns • Processors such as Pentium III, DEC Alpha 21264 can execute 3 to 6 instructions per cycle, but generally sustain less than 2 instructions per cycle • Reasons: • Scheduling instruction window is small • Code transformations are not possible • Run-time scheduling adds complexity and latency to instruction execution • Cache and branch penalties

Compiler: Hardware: Technique Overview • Instruction scheduling:

Compiler: Hardware: Technique Overview • Branch resolution:

Hardware vs. Compiler • In ILP extraction, Branch Prediction, the compiler and hardware should each do what they do best • Hardware: • Architecture can see patterns in input data, cache or branch prediction behavior • Less is more: Power, Cost, Performance, Size, Reliability • Reliability: difficult to fix after deployment (FDIV Bug) • Compiler: • Compiler can transform code significantly • Compiler can see entire program • Do the slow and/or complex analysis run once • Compiler cannot deal with input variation • Reliability: can be fixed after deployment (gcc 2.8.2)

Superscalar Terminology • Basic Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS R5000 has 8 stages • Advanced Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.

Superscalar Processors

Data Flow $f2 $f4 $f6 y + + v Critical Path = 9 cycles $f4 $f8 w * z + z x + $f10 $f12 Superscalar Execution Example • Assumptions • Single FP adder takes 2 cycles • Single FP multipler takes 5 cycles • Can issue add & multiply together • Must issue in-order (Single adder, data dependence) (In order) v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 v w x (inorder) y z

v w x y z Adding Advanced Features • Out Of Order Issue • Can start y as soon as adder available • Must hold back z until $f10 not busy & adder available • With Register Renaming v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 v w x y z v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10

Issue Criteria • Instruction availability (from I-Cache) • Operand availability (dataflow issues) • Functional Unit Availability (Int/Float/Mem/other) • Writeback availability (Register File Ports)

Read-after-Write (RAW) Dependences • Also known as a “true” dependence • Example: S1: addq r1, r2, r3 S2: addq r3, r4, r4 • How to optimize? • cannot be optimized away

Write-after-Read (WAR) Dependences • Also known as an “anti” dependence • Example: S1: addq r1, r2, r3 S2: addq r4, r5, r1 ... addq r1, r6, r7 • How to optimize? • rename dependent register (e.g., r1 in S2 -> r8) S1: addq r1, r2, r3 S2: addq r4, r5, r8 ... addq r8, r6, r7

Write-after-Write (WAW) Dependences • Also known as an “output” dependence • Example: S1: addq r1, r2, r3 S2: addq r4, r5, r3 ... addq r3, r6, r7 • How to optimize? • rename dependent register (e.g., r3 in S2 -> r8) S1: addq r1, r2, r3 S2: addq r4, r5, r8 ... addq r8, r6, r7

Scoreboarding to Detect RAW Hazards • Associate “presence” bit with each register • Mark bit empty when instruction that writes register issues • Mark full when instruction completes (or gets far enough for bypassing to provide result) • Instruction can’t issue if any of its source registers are empty

Scoreboarding Example

Pipelining for WAR Hazards (In-Order Issue)

WAW Hazards a Problem With Uneven Pipelines

Brain-Dead WAW Hazard Resolution

A Better Idea – Register Renaming (Reservation Stations) • Insight: WAR and WAW hazards occur because architecture has a limited number of registers. • Solution: build additional registers and rename in hardware to break dependencies • More useful in out-of-order processors.

ld r2<- r3 ld hw1<- r3 sub r7<- r2, r9 sub r7 <-hw1, r9 add r2<- r1, r4 add hw2<- r1, r4 IF Breaks dependence ID ID add no longer has to wait for EX EX ld to complete EX WB MEM WB

Tomasulo Algorithm

Reservation Stations • Contain source, dest, and opcode for an operation • Source, dest may be value, load buffer, store buffer, reservation station -- renaming Ex: r3 = 25 add r2<- r3, r4 r4 = 37 sub r5<- r6, r2 r6 = 41 mul r2<- r7, r8 r7 = 1 r8 = 52 op dest source1 source2 3 mul r2 1 52 2 sub r5 41 res1dest 1 add res2s2 25 37

General Principles • Must be Able to Flush Partially-Executed Instructions • Branch mispredictions • Earlier instruction generates exception • Special Treatment of “Architectural State” • Programmer-visible registers • Memory locations • Don’t do actual update until certain instruction should be executed • Emulate “Data Flow” Execution Model • Instruction can execute whenever operands available

PentiumII Block Diagram Microprocessor Report

Dispatching Actions • Generate Entry in Retirement Buffer (Reorder buffer) • Buffer tracking instructions currently “in flight” • Dispatched but not yet completed • Circular buffer in program order • Instruction tagged with branches they depend on • Easy to flush if mispredicted • Assign Rename Register as Target • Additional registers used as targets for in-flight instructions • Instruction updates this register • Update of actual architectural register occurs only when instruction retired

Reorder Buffer • When instruction issues, reserve entry in reorder buffer • Place results in reorder buffer out-of-order as instructions complete. • Move results from reorder buffer to register file in order.

PentiumII Block Diagram Microprocessor Report

Hazard Handling with Renaming • Dispatch Unit Maintains Mapping • From register ID to actual register • Could be the actual architectural register • Not target of currently-executing instruction • Could be rename register • Perhaps already written by instruction that has not been retired • E.g., still waiting for confirmation of branch prediction • Perhaps instruction result not yet computed • Grab later when available • Hazards • RAW: Mapping identifies operand source • WAR: Write will be to different rename register • WAW: Writes will be to different rename register

Moving Instructions Around • Reservation Stations • Buffers associated with execution units • Hold instructions prior to execution • Plus those operands that are available • May be waiting for one or more operands • Operand mapped to rename register that is not yet available • May be waiting for unit to be available • Completion Busses • Results generated by execution units • Tagged by rename register ID • Monitored by reservation stations • So they can get needed operands • Effectively implements bypassing • Supply results to completion unit

Retiring Instructions • Retire in Program Order • When instruction is at head of buffer • Up to 4 per cycle • Enable change of architectural state • Transfer from rename register to architectural • Free rename register for use by another instruction • Allow pending store operation to take place • Flush if Should not be Executed • Tagged by branch that was mispredicted • Follows instruction that raised exception • As if instructions had never been fetched

Problems With Out-of-Order • Exceptions/interrupts -- Want to provide “precise” exceptions • “Precise exceptions/interrupts” -- system state when interrupt taken corresponds to execution of the program on an in-order processor • Architectural process state (context): memory space contents, process register contents, processor status • Sequential Architecture Model (SAM): program counter sequences through instructions, finishes one before starting another

Problems With Out-of-Order • Consistent State (CS): An architecture state for which a dynamic program counter can be found: • All instructions before this program counter are done • No instruction at and after this program counter is processed • Properties of CS: • A strictly sequential architecture implementation model brings the process from one CS to another • A CS can be activated without any extra implementation state to resume execution • Is possible but... architecture generation and operating system coordination

Exceptions • An exception is a transfer of control to the OS in response to some event (i.e. change in processor state) User Process Operating System event exception exception processing by exception handler exception return (optional)

Exception: Fault and Trap • Conditions: • Not expected by routine programming logic • Have to be handled before resuming execution • Fault: the violating instruction should be retired after exception handling • Trap: the execution resumes after the violating instruction (like calls) • Precise State (PS) for an exception: • Fault PS: CS corresponds to dynamic PC just before the faulting instruction • Trap PS: CS corresponds to dynamic PC just after the trapping instruction

CPU Exceptions • Internal exceptions occur as a result of events generated by executing instructions • -Errors during instruction execution • arithmetic overflow, address error, parity error, undefined instruction • -Events that require OS intervention • virtual memory page fault • External exceptions occur as a result of events generated by devices external to the processor • -I/O interrupts • -Hard reset interrupt • -Soft reset interrupt

User Process A ? C Precise vs. Imprecise Exceptions • In the Alpha architecture: • arithmetic exceptions may be imprecise (similar to the CRAY-1) • motivation: simplifies pipeline design, helping to increase performance • all other exceptions are precise • Imprecise exceptions: • all instructions before the excepting instruction complete • the excepting instruction and instructions after it may or may not be complete • What if precise exceptions are needed? • insert a TRAPB (trap barrier) instruction immediately after • stalls until certain that no earlier insts take exceptions

Why Care About Precise Interrupts? • Want to be able to resume program execution after page faults, I/O interrupts • Debugging -- want to present CS to user after faults • Machine emulation -- some architectures simulate opcodes in software after taking unimplemented instruction exception • Basic Idea -- Issue instructions out-of-order, but provide way to construct Consistent State (CS) when exceptions occur • Out-of-order issue with in-order completion -- buffer out-of-order results and writeback memory&registers in order • Rollback -- allow operations to complete out-of-order but keep history information to undo out-of-order ops if exception occurs

Reorder Buffer • Requires complex bypassing to allow instructions to use values in reorder buffer as inputs.

History Buffer • Results go into register file out-of-order, but keep old state in history buffer until all previous ops done • Requires hardware to “roll back” history buffer on exception

Future File • Keep two register files • “Architectural File” has in-order results • “Future File” has out-of-order results for use as operands • Use Reorder Buffer to keep Architectural File in-order

Comparison • Reorder Buffer -- requires complex bypass paths, which may increase register read latency • History Buffer -- requires rollback logic to put old results back in register file on exceptions • Future file -- Fast, no bypass paths, simple rollback, but duplicates register file

In-Order vs. Out-of-Order

In-Order vs. Out-of-Order

Presentation Transcript

Batting Out of Order

Higher Order vs. Lower Order Concerns

Order out of Chaos

Chaos vs. Order

Batting Out of Order

Out-of-Order Execution Structures

First Order vs Second Order Transitions in Quantum Magnets

Order vs. Chaos

Out-of-Order Commit Processors

Out-of-Order Execution

Order vs. Chaos

Out-of-Order Speculative Execution

Is Out-Of-Order Out Of Date ?

Lecture: Out-of-order Processors

Batting Out of Order

Lecture: Out-of-order Processors

Lecture: Out-of-order Processors

In-order vs. Out-of-order Execution

Out-of-Order Speculative Execution