980 likes | 1.14k Vues
OMSE 510: Computing Foundations 4: The CPU!. Chris Gilmore <grimjack@cs.pdx.edu> Systems Software Lab Portland State University/OMSE. Today. Caches DLX Assembly CPU Overview. Introduction to RISC. Reduced Instruction Set Computer 1975 John Cocke IBM 801
E N D
OMSE 510: Computing Foundations4: The CPU! Chris Gilmore <grimjack@cs.pdx.edu> Systems Software Lab Portland State University/OMSE
Today • Caches • DLX Assembly • CPU Overview
Introduction to RISC • Reduced Instruction Set Computer • 1975 John Cocke IBM 801 • IBM started working on a RISC-type computer on 1975 without calling it by this name • used as an I/O processor for IBM Mainframe • Patterson and Hennessey • RISC was first introduction by Patterson and Ditzel in1980 • Produced first RISC chip in early 1980s • RISC I and RISC II from Berkeley and MIPS from Stanford
RISC Chips • RISC II • Had 39 instructions and 2 addressing modes, 3 data types • 234 combinations • Compared to VAX 304 inst, 16 address mode, 14 data type • 68,096 • Found that • Compiled programs were 30% larger than CISC (Vax 11/780) • Ran upto 5 times faster than 68000 • Assembler-Compiler ratio (Execution time of assembler program divided by the exec time of compiled version) • Ratio < 50% for CISC • 90% for RISC
RISC Definition 1. Single cycle operation 2. Load / store design 3. Hardwired control unit 4. Few instructions and addressing modes 5. Fixed instruction format 6. More compile time effort to avoid pipeline penalties
Disadvantages of CISC • Large, complicated, and time-consuming instruction set • Complex CU to decode and execute • Not necessarily faster than a sequence of several RISC instructions • Complexity of the CISC CU • A large number of design errors • Longer design time • Too large a choice for the compiler • Very difficult to design the optimal compiler • Not always yield the most efficient code • Specialized to fit certain HLL instruction • May be redundant for another HLL • Relatively low cost/benefit factor
The Advantage of RISC • RISC and VLSI realization • Relatively small and simple C.U. hardware • RISC I : 6 % RISC II : 10 % MC68020 : 68 % • Higher chance of fitting other features on a chip • Can fit a large number of CPU registers • Enhances the throughput and HLL support • Increase the regularization factor
The Advantage of RISC • RISC and Computing Speed • Faster decoding process • Small instruction set, addressing mode, fixed instruction format • Reduce Memory access. • A large number of CPU registers permits R-R operations • Faster Parameter passing • Register windows in RISC I and RISC II • streamlined instruction handing • All instruction have the same length • All execute in one cycle • Suitable for the pipelined implementation
The Advantage of RISC • RISC and design costs and reliability • Shorter time to design and reduction of overall design costs • Reduce the probability that the end product will be obsolete • Reduced number of design errors • Virtual Memory Management System enhancement • inst will not cross word boundaries and can’t wind up on two separate pages
The Advantage of RISC • RISC and HLL Support • Shorter and simpler compiler • Usually only a single choice rather than several choice in CISC • Large Number of CPU registers • More efficient code optimization • Fast Parameter Passing between procedures • “register windows” • Reduced burden on compiler writer
The Disadvantage and Criticism of RISC(`80s) • RISC code to be longer • Extra burden on the machine and assembly language programmer • Several instructions required per a single CISC instruction • More Memory Locations for their storage • Floating Point Support and VMM support
RISC Characteristics • Pipelined operation • Compiler responsible for pipeline conflict resolution • Delayed branch • Delayed load
Question #1: Why do microcoding? • If simple instruction could execute at very high clock rate… • If you could even write compilers to produce microinstructions… • If most programs use simple instructions and addressing modes… • If microcode is kept in RAM instead of ROM so as to fix bugs … • If same memory used for control memory could be used instead as cache for “macroinstructions”… • Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)
A B C D Pipelining is Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes
A B C D Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?
30 40 40 40 40 20 A B C D Pipelined Laundry: Start work ASAP 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r • Pipelined laundry takes 3.5 hours for 4 loads
30 40 40 40 40 20 A B C D Pipelining Lessons • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences 6 PM 7 8 9 Time T a s k O r d e r
Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Execution Cycle Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction
Ifetch Reg/Dec Exec Mem Wr The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file
IR <= MEM[PC] PC <= PC + 4 0000 ALUout <= PC +SX 0001 LW BEQ R-type ORi SW ALUout <= A fun B ALUout <= A op ZX ALUout <= A + SX ALUout <= A + SX If A = B then PC <= ALUout 0100 0110 1000 1011 0010 M <= MEM[ALUout] MEM[ALUout] <= B 1001 1100 R[rd] <= ALUout R[rt] <= ALUout R[rt] <= M 0101 0111 1010 Note: These 5 stages were there all along! Fetch Decode Execute Memory Write-back
Pipelining • Improve performance by increasing throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?
Basic Idea • What do we need to add to split the datapath into stages?
Graphically Representing Pipelines • Can help with answering questions like: • how many cycles does it take to execute this code? • what is the ALU doing during cycle 4? • use this representation to help understand datapaths
IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Conventional Pipelined Execution Representation Time Program Flow
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Pipeline Implementation: Load Store R-type
Why Pipeline? • Suppose we execute 100 instructions • Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns • Ideal pipelined machine • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Im Dm Reg Reg ALU Im Dm Reg Reg ALU Im Dm Reg Reg ALU Im Dm Reg Reg ALU Im Dm Reg Reg ALU Why Pipeline? Because we can! Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4
Can pipelining get us into trouble? • Yes:Pipeline Hazards • structural hazards: attempt to use the same resource two different ways at the same time • E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) • control hazards: attempt to make a decision before condition is evaluated • E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in • branch instructions • data hazards: attempt to use item before it is ready • E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer • instruction depends on result of prior instruction still in the pipeline • Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards
Mem ALU Mem Mem Reg Reg ALU Mem Mem Reg Reg ALU ALU Mem Mem Reg Reg ALU Single Memory is a Structural Hazard Time (clock cycles) I n s t r. O r d e r Load Mem Reg Reg Instr 1 Instr 2 Mem Mem Reg Reg Instr 3 Instr 4 Detection is easy in this case! (right half highlight means read, left half write)
Structural Hazards limit performance • Example: if 1.3 memory accesses per instruction and only one memory access per cycle then • average CPI 1.3 • otherwise resource is more than 100% utilized
I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem ALU Mem Reg Reg Beq Mem ALU Load Lost potential Mem Reg Reg Mem ALU Control Hazard Solution #1: Stall • Stall: wait until decision is clear • Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow • Move decision to end of decode • save 1 cycle per branch
I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Mem ALU Load Mem Mem Reg Reg Mem ALU ALU Control Hazard Solution #2: Predict • Predict: guess one direction then back up if wrong • Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) • Need to “Squash” and restart following instruction if wrong • Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5 • Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch) • More dynamic scheme: history of 1 branch ( 90%)
I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem ALU Mem Reg Reg Beq Mem ALU Misc Mem Mem Reg Reg ALU Load Mem Mem Reg Reg ALU Control Hazard Solution #3: Delayed Branch • Delayed Branch: Redefine branch behavior (takes place after next instruction) • Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) • As we launch more instruction per clock cycle, less useful
Delayed/Predicted Branch • Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch taken • From fall through: only valuable when branch not taken • Cancelling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Data Hazard on r1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Im ALU Im ALU Im Dm Reg Reg ALU Data Hazard on r1: • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11
Im ALU Im ALU Im Dm Reg Reg ALU Data Hazard Solution: • “Forward” result from one stage to another • “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11
Im ALU Forwarding (or Bypassing): What about Loads? • Dependencies backwards in time are hazards • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm sub r4,r1,r3 Dm Reg Reg
Im Dm Reg Reg ALU Forwarding (or Bypassing): What about Loads • Dependencies backwards in time are hazards • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm Stall sub r4,r1,r3
Conflicts/Problems • I-cache and D-cache are accessed in the same cycle – it • helps to implement them separately • Registers are read and written in the same cycle – easy to • deal with if register read/write time equals cycle time/2 • (else, use bypassing) • Branch target changes only at the end of the second stage • -- what do you do in the meantime? • Data between stages get latched into registers (overhead • that increases latency per instruction)
Control Hazards • Simple techniques to handle control hazard stalls: • for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) • assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction • fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
Slowdowns from Stalls • Perfect pipelining with no hazards an instruction • completes every cycle (total cycles ~ num instructions) • speedup = increase in clock speed = num pipeline stages • With hazards and stalls, some cycles (= stall time) go by • during which no instruction completes, and then the stalled • instruction completes • Total cycles = number of instructions + stall cycles • Slowdown because of stalls = 1/ (1 + stall cycles per instr)
A S B M D Control and Datapath: Split state diag into 5 pieces IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File Exec PC IR Next PC Inst. Mem Mem Access Data Mem
Three Generic Data Hazards InstrI followed by InstrJ • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
Three Generic Data Hazards InstrI followed by InstrJ • Write After Read (WAR)InstrJ tries to write operand before InstrI reads i • Gets wrong operand • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5
Three Generic Data Hazards InstrI followed by InstrJ • Write After Write (WAW)InstrJ tries to write operand before InstrI writes it • Leaves wrong result ( InstrI not InstrJ ) • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Can have WAR and WAW in more complicated pipes
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Summary: Pipelining • Reduce CPI by overlapping many instructions • Average throughput of approximately 1 CPI with fast clock • Utilize capabilities of the Datapath • start next instruction while working on the current one • limited by length of longest stage (plus fill/flush) • detect and resolve hazards • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction
Some Issues for your consideration • Won’t be tested • We’ll talk about modern processors and what’s really hard: • exception handling • trying to improve performance with out-of-order execution, etc. • Trying to get CPI < 1 (Superscalar execution)