170 likes | 203 Vues
This lecture covers the fundamentals of implementing an Out-of-Order Core, focusing on concepts like register renaming, commit, Load/Store Queue (LSQ), and issue queue. The Alpha 21264 Out-of-Order Implementation is discussed, along with topics such as branch prediction, instruction fetching, and reorder buffer. Examples and scenarios are provided to illustrate the core concepts discussed. The lecture highlights the importance of speculative register mapping, memory dependence checking, and out-of-order execution for efficient core design in processors.
 
                
                E N D
Lecture 18: Core Design • Today: basics of implementing a correct ooo core: • register renaming, commit, LSQ, issue queue
The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 ALU ALU ALU Speculative Reg Map R1P36 R2P34 Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)
Rename A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 RAR lr3 RAW lr1 WAR lr2 WAW lr6 A ; BC ; D pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 RAR pr3 RAW pr7 WAR x WAW x AB ; CD
Commit Example Assume a processor with 6 logical regs and 10 physical regs Map Old / New lr1 pr1 pr7 lr2 pr2 pr8 lr6 pr6 pr9 lr6 pr9 pr10 lr3 pr3 pr1 lr4 pr4 pr2 A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 E lr3  lr6 + lr2 F lr4  lr3 + lr4 pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 pr1  pr10 + pr8 pr2  pr1 + pr4
Out-of-Order Loads/Stores Ld R1  [R2] Ld R3  [R4] St R5  [R6] Ld R7  [R8] Ld R9[R10]
Memory Dependence Checking Ld 0x abcdef • The issue queue checks for • register dependences and • executes instructions as soon • as registers are ready • Loads/stores access memory • as well – must check for RAW, • WAW, and WAR hazards for • memory as well • Hence, first check for register • dependences to compute • effective addresses; then check • for memory dependences Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00
Memory Dependence Checking • Load and store addresses are • maintained in program order in • the Load/Store Queue (LSQ) • Loads can issue if they are • guaranteed to not have true • dependences with earlier stores • Stores can issue only if we are • ready to modify memory (can not • recover if an earlier instr raises • an exception) Ld 0x abcdef Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00
The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Instr 7 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 LD R4  8[R3] ST R4  8[R1] Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 P37  8[P35] P37  8[P36] ALU ALU ALU Speculative Reg Map R1P36 R2P34 Results written to regfile and tags broadcast to IQ Instr Fetch Queue Issue Queue (IQ) ALU P37  [P35 + 8] P37  [P36 + 8] D-Cache LSQ
Speculative Issue • Instr I1 leaves the issue queue at start of cycle 6; the instr • then reads operands from the regfile, wires are traversed, • instruction executes, result is available at end of cycle 8 • If operand availability is broadcast to issue queue in cycle 9, • dependent instruction leaves in cycle 10 • This causes a 4-cycle gap between successive instrs • Hence, if we know that the instruction takes a cycle to • execute, the operand is broadcast to the issue queue in • cycle 6 and the dependent instr leaves issue queue in • cycle 7; the input operand is correctly bypassed at the FU
Load Hit Speculation • The previous optimization assumes that we know the exact • latency for every operation • This is true for all ops except loads (cache hit or miss?) • Assume hit and schedule accordingly; on a cache miss, • must squash all speculatively issued instructions; an • instruction therefore sits in the queue until load hits are • determined
Register Rename Logic Map Table Physical Source Regs Physical Dest Regs Logical Source Regs Mux Free Pool Logical Dest Regs Dependence Check Logic Logical Source Reg
Map Table – RAM 7-bits 7-bits 7-bits 7-bits 7-bits Phys reg id Num entries = Num logical regs Shadow copies (shift register)
Map Table – CAM 5-bits 1-bit 1-bit Logical reg id v a l i d Num entries = Num phys regs Shadow copies
Wakeup Logic tag1 tagIW … = = or or rdyL tagL tagR rdyR . . . . . . rdyL tagL tagR rdyR
Selection Logic Issue window req grant enable anyreq Arbiter cell enable • For multiple FUs, will need sequential selectors
Structure Complexities • Critical structures: • register map tables, issue queue, LSQ, register file, • register bypass • Cycle time is heavily influenced by: • window size (physical register size), issue width (#FUs) • Conflict between the desire to increase IPC and clock speed
Title • Bullet