330 likes | 502 Vues
CS 161 Review for Test 2. Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson). How to Study for Test 2 : Chap 5. Single-cycle (CPI=1) processor know how to reason about processor organization (datapath, control)
E N D
CS 161Review for Test 2 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson(http.cs.berkeley.edu/~patterson)
How to Study for Test 2 : Chap 5 • Single-cycle (CPI=1) processor • know how to reason about processor organization (datapath, control) • e.g., how to add another instruction? (must modify both control, datapath, or both) • How to add multiplexors in the datapath • How to design hardware control unit • Multicycle (CPI>1) processor • Changes to Single Cycle Datapath • Control Design through FSM • how to add new instruction to multicycle?
PC Registers Step 1 Step 2 Step 3 Putting Together a Datapath for MIPS Address Address Data Out Data In Data Out • Question: Which instruction uses which steps and what is the execution time? Data In Data Out A L U Memory(Dmem) Memory(Imem) Step 4 5
Datapath Timing: Single-cycle vs. Pipelined • Suppose the following delays for major functional units: • 2 ns for a memory access or ALU operation • 1 ns for register file read or write • Total datapath delay for single-cycle: • What about multi-cycle datapath? Insn Insn Reg ALU Data Reg TotalType Fetch Read Oper Access Write Time beq 2ns 1ns 2ns 5ns R-form 2ns 1ns 2ns 1ns 6ns sw 2ns 1ns 2ns 2ns 7nslw 2ns 1ns 2ns 2ns 1ns 8ns
Implementing Main Control Main Control has one 6-bit input, 9 outputs (7 are 1-bit, ALUOp is 2 bits) To build Main Control as sum-of-products: (1) Construct a minterm for each different instruction (or R-type); each minterm corresponds to a single instruction (or all of the R-type instructions), e.g., MR-format, Mlw(2) Determine each main control output by forming the logical OR of relevant minterms (instructions), e.g., RegWrite: MR-format OR Mlw Main Control RegDstBranchMemReadMemtoRegALUopMemWriteALUSrcRegWrite op 2
a d d SignExtend Single-Cycle MIPS-lite CPU M u x a d d << 2 4 PCSrc MemWrite 25:21 ReadReg1 Branch Read Addr P C Readdata Readdata1 Zero ReadReg2 31:0 20:16 A L U Instruc- tion Address Readdata2 M u x MemTo- Reg WriteReg M u x Imem Dmem Regs ALU- con WriteData WriteData 15:11 op=[31:26] M u x RegDst ALU- src ALU Control RegWrite MemRead Main Control 15:0 5:0 ALUOp
a d d SignExtend R-format Execution Illustration (step 4) M u x a d d << 2 4 PCSrc=0 MemWrite 25:21 ReadReg1 Branch Read Addr P C Readdata Readdata1 Zero ReadReg2 31:0 20:16 A L U Instruc- tion [r1] + [r2] Address Readdata2 M u x MemTo- Reg=1 WriteReg M u x Dmem Imem Regs ALU- con 15:11 WriteData WriteData M u x ALU- src=0 RegDst=1 ALU Control RegWrite MemRead Main Control 15:0 5:0 ALUOp
MIPS-lite Multicycle Version Multicycle Datapath (overview) PC Instruction Register ReadReg1 Address Memory A Readdata 1 ReadReg2 A L U Instruction or Data ALU- Out Registers B Readdata 2 WriteReg Data MemoryData Register Data • One ALU (no extra adders) • One Memory (no separate Imem, Dmem) • New Temporary Registers (“clocked”/require clock input)
Sgn Ext- end ALU Control MIPS-lite Multicycle Version Cycle 3 Datapath (R-format) PC M u x ReadReg1 Address M u x 25:21 z Readdata1 Mem A L U A ReadReg2 ALU- Out M u x 20:16 Read Data Readdata2 B WriteReg M u x 15:0 Write Data 4 15:11 0 1M 2 u 3 x IR Regs 3 WriteData MDR M u x (funct) 5:0 << 2 ALUOut=A op B 2 2
FSM diagram for Multicycle Machine cycle1 cycle2 MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 1 ALUOp = 0PCWrite PCSrc = 0 start new instruction ALUSrcA = 0 ALUSrcB = 3 ALUOp = 0 1 state 0 lw/sw beq R-format 8 cycle3 6 ALUSrcA = 1 ALUSrcB = 0 ALUOp =1 PCWriteCond PCSrc = 1 2 ALUSrcA = 1 ALUSrcB = 0 ALUOp =2 ALUSrcA = 1 ALUSrcB = 2 ALUOp = 0 Branch Completion Memory Access R-format execution
Implementing the FSM controller (C.3) P C W r i t e P C W r i t e C o n d PLA or ROM implementation of both next-state and output functions I o r D M e m R e a d M e m W r i t e DatapathControl Points I R W r i t e M e m t o R e g P C Src A L U O p O u t p u t s A L U S r c B A L U S r c A R e g W r i t e R e g D s t N S 3 } N S 2 Next-state N S 1 I n p u t s N S 0 5 4 3 2 1 0 p p p p p p 3 2 1 0 O O O O O O S S S S Instruction register opcode field state register
Micro-programmed Control (Chap. 5.5) • In microprogrammed control, FSM states become microinstructions of a microprogram(“microcode”) • one FSM state=one microinstruction • usually represent each micro-instruction textually, like an assembly instruction • FSM current state register becomes the microprogram counter (micro-PC) • normal sequencing: add 1 to micro-PC to get next micro-instruction • microprogram branch: separate logic determines next microinstruction
Micro-program for Multi-cycle Machine ALU Reg Mem PC Next Op In1 In2 File Op Src Writ -Instr --------------------- ------ ---------------- ------- --------- Fetch: Add PC 4 Rd PC ALU Add PC SE*4 Rd [D1]Mem: Add A SE [D2]LW: Rd ALU Wr FetchSW: Wr ALU FetchRform: funct A B Wr FetchBEQ: Sub A B Equ FetchD1 = { Mem, Rform, BEQ }D2 = { LW, SW }
How to Study for Test 2 : Chap 6 • Pipelined Processor • how pipelined datapath, control differs from architectures of Chapter 5? • All instructions execute same 5 cycles • pipeline registers to separate the stages of datapath & control • Problems for Pipelining • pipeline hazards: structural, data, control (how each solved?)
30 30 30 30 30 30 30 A B C D Pipelining Lessons 6 PM 7 8 9 • Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number of pipe stages • What is real speedup? • Time to “fill” pipeline and time to “drain” it reduces speedup Time T a s k O r d e r
IFtch Dcd Exec Mem WB Space-Time Diagram Time • To simplify pipeline, every instruction takes same number of steps, called stages • One clock cycle per stage IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow
Problems for Pipelining • Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup • Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) • Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)
IM ALU IM ALU ALU Control Hazard : Solution 1 • guess branch taken, then back up if wrong: “branch prediction” • For example, Predict not taken • Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time) • More dynamic scheme: keep history of the branch instruction (~ 90%) I n s t r. O r d e r Time (clock cycles) DM Reg Reg add DM Reg Reg beq Load IM DM Reg Reg
IM ALU IM ALU ALU ALU Control Hazard : Solution 2 • Redefine branch behavior (takes place after next instruction) “delayed branch” • Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” ( 50% of time) I n s t r. O r d e r Time (clock cycles) DM Reg Reg add DM Reg Reg beq Misc IM DM Reg Reg Load IM DM Reg Reg
IM ALU IM ALU IM DM Reg Reg ALU Data Hazard on $1: Illustration Dependencies backwards in time are hazards Time (clock cycles) I n s t r. O r d e r IF ID/RF EX MEM WB add $1,$2,$3 Reg Reg ALU IM DM sub $4,$1,$3 DM Reg Reg DM Reg Reg and $6,$1,$7 IM DM Reg Reg or $8,$1,$9 ALU xor $10,$1,$11
IM ALU IM ALU IM DM Reg Reg ALU Data Hazard : Solution: • “Forward” result from one stage to another • “or” OK if implement register file properly Time (clock cycles) I n s t r. O r d e r IF ID/RF EX MEM WB add $1,$2,$3 Reg Reg ALU IM DM sub $4,$1,$3 DM Reg Reg DM Reg Reg and $6,$1,$7 IM DM Reg Reg or $8,$1,$9 ALU xor $10,$1,$11
IM ALU IM ALU bubble bubble bubble Data Hazard Even with Forwarding • Must stall pipeline 1 cycle (insert 1 bubble) Time (clock cycles) IF ID/RF EX MEM WB lw$1, 0($2) Reg Reg ALU IM DM sub $4,$1,$6 DM Reg Reg DM Reg Reg and $6,$1,$7 or $8,$1,$9 IM DM Reg ALU
How to Study for Test 2 : Chap 7 • Processor-Memory performance gap: problem for hardware designers and software developers alike • Memory Hierarchy--The Goal: want to create illusion of single large, fast memory • access that hit in highest level are processed most quickly • Exploit Principle of Locality to obtain high hit rate • Caches vs. Virtual Memory: how are they similar? Different?
Memory Hierarchy: Terminology • Hit Time: Time to access the upper level which consists of • Time to determine hit/miss +Memory access time Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Note: Hit Time << Miss Penalty [Note: “<<“ here means “much less than”]
tttttttttttttttttiiiiiiiiiioooo tagindexbyte to checktooffsetif haveselectwithincorrect blockblockblock Issues with Direct-Mapped • If block size > 1, rightmost bits of index are really the offset within the indexed block Q: How do Set-Associative and Fully-Associative Designs Look?
Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Read from cache at offset, return word b Tag field Index field Offset • 000000000000000000 0000000001 0100 Index 0 1 0 a b c d 0 0 0 0 0 0 0 0
Miss Rate Versus Block Size 4 0 % 3 5 % 3 0 % 2 5 % e t a r s 2 0 % s i M 1 5 % 1 0 % 5 % 0 % 4 16 64 256 B l o c k s i z e (bytes) 1 K B total cache size 8 K B 1 6 K B • Figure 7.12 - for direct mapped cache 6 4 K B 2 5 6 K B
Compromise: N-way Set Associative Cache • N-way set associative: N cache blocks for each Cache Index • Like having N direct mapped caches operating in parallel • Example: 2-way set associative cache • Cache Index selects a “set” of 2 blocks from the cache • The 2 tags in set are compared in parallel • Data is selected based on the tag result (which matched the address) • Where is a data written? Based on Replacement Policy, FIFO, LRU, Random
Improving Cache Performance • In general, want to minimize Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate (recall Hit Time << Miss Penalty) • Generally, two ways to look at • Larger Block Size • Larger Cache • Higher Associativity • Reducing DRAM latency • Miss penalty ? ---> L2 cache approach ReduceMiss Rate Reduces Miss Penalty
Virtual Memory has own terminology • Each process has its own private “virtual address space” (e.g., 232 Bytes); CPU actually generates “virtual addresses” • Each computer has a “physical address space” (e.g., 128 MegaBytes DRAM); also called “real memory” • Library analogy: • virtual address is like the title of a book • physical address is the location of book in the library as given by its Library of Congress call number
Mapping Virtual to Physical Address Virtual Address 31 30 29 28 27 .………………….12 11 10 9 8 ……..……. 3 2 1 0 Virtual Page Number Page Offset 1KB page size Translation Physical Page Number Page Offset 9 8 ……..……. 3 2 1 0 29 28 27 .………………….12 11 10 Physical Address
How Translate Fast? • Observation: since there is locality in pages of data, must be locality in virtual addresses of those pages! • Why not create a cache of virtual to physical address translations to make translation fast? (smaller is faster) • For historical reasons, such a “page table cache” is called a Translation Lookaside Buffer, or TLB • TLB organization is same as Icache or Dcache – Direct-mapped or Set Associative
Access TLB and Cache in Parallel? • Recall: address translation is only for virtual page number, not page offset • If cache index bits of PA “fit within” page offset of VA, then index is not translated can read cache block while simultaneously accessing TLB • “Virtually indexed, physically tagged cache” (avoids aliasing problem) VA page offset virtual page number PA ofs tag index