William Stallings Computer Organization and Architecture 5 th Edition

William Stallings Computer Organization and Architecture5th Edition Chapter 11 CPU Structure and Function CPU的结构和功能

Topics • Processor Organization • Register Organization • Instruction Cycle • Instruction Pipelining • The Pentium Processor

CPU Structure • CPU must: CPU必须具备的功能： • Fetch instructions 能够从存储器读取指令 • Interpret instructions 对指令进行解析译码 • Fetch data 取指令所需的数据 • Process data 对数据进行处理 • Write data 将处理后的数据写回目的地 CPU需要一个小的内部存储器暂存数据和指令 CPU needs a small internal memory

CPU With Systems Bus

CPU Internal Structure

Registers （寄存器） • CPU must have some working space (temporary storage) CPU必须有部分工作空间进行暂时存储 • Called registers 这部分空间叫寄存器 • Number and function vary between processor designs 它们的数量和功能因处理器的设计而不同 • One of the major design decisions 寄存器是设计CPU时考虑的一个主要因素 • Top level of memory hierarchy 位于存储器分级中的较高层 • Two categories: 分为两类： • User-visible registers 用户可见寄存器 • Control and status registers 控制和状态寄存器

User Visible Registers 用户可见寄存器 • General Purpose 通用寄存器 • Data 数据寄存器 • Address 地址寄存器 • Condition Codes 条件代码寄存器

User Visible Registers • General Purpose Registers • May be true general purpose 真正意义的通用 • May be restricted 可能有一定的限制 • Data registers • Accumulator register 累加寄存器 • Addressing registers • Segment pointers 段寄存器 • Index registers 变址寄存器 • Stack Pointer 堆栈寄存器

General or Special? 比较 • Make them general purpose • Increase flexibility and programmer options 增加了灵活性和程序员的可选择性 • Increase instruction size & complexity 增加了指令的长度和复杂度 • Make them specialized • Smaller (faster) instructions 指令更小更块 • Less flexibility 灵活性变低 • The trend seems to be toward the use of specialized registers. 现在趋向于专用寄存器

How Many GP Registers? 个数 • Between 8 – 32 大都8－32个 • Fewer = more memory references 寄存器个数太少，导致频繁访问存储器 • More does not reduce memory references 寄存器个数太多也不能显著减少访问存储器

How big? 寄存器的长度 • Large enough to hold the largest address 要能够保存最长的地址 • Large enough to hold most data types 要能够保存大多数数据类型的值 • Often possible to combine two data registers 两个数据寄存器经常合并为一个使用 • C programming • double a; • long int a;

Condition Code Registers • Condition codes are bits set by the CPU hardware as the result of operations. • Sets of individual bits 标志位的集合 • e.g. result of last operation was zero • At least partially visible to the user 至少部分对用户可见 • Can be read (implicitly) by programs 程序可以读取 • e.g. Jump if zero • Can not (usually) be set by programs 一般不能有程序进行设置

Control & Status Registers • Program Counter 程序计数器（PC） • Instruction Register 指令寄存器（IR） • Memory Address Register 存储地址寄存器MAR • Memory Buffer Register 存储缓冲寄存器MBR • Revision: what do these all do?

Program Status Word 程序状态字PSW • A set of bits，Includes Condition Codes 状态位集合 • Sign of last result 符号：最后算术运算结果符号位 • Zero 零标记：当结果是零时被置位 • Carry 进位标记：借位或进位时置位 • Equal 等于标记：逻辑比较结果相等置位 • Overflow 溢出标记：用于指示算术溢出 • Interrupt enable/disable 中断允许／禁止 • Supervisor 监督：指出CPU是执行在监督模式中还是在用户模式中

Program Status Word - Example • Motorola 68000’s PSW System Byte User Byte Interrupt Mask Supervisor Status Trace Mode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 T S I2 I1 I0 X N Z V C

Other Registers • May have registers pointing to: • Process control blocks (see O/S) 进程控制块（PCB） • Interrupt Vectors (see O/S) 中断向量 • N.B. CPU design and operating system design are closely linked CPU设计和操作系统设计紧密相关

Example Register Organizations

Instruction Cycle Fetch Cycle 取指令周期 Interrupt Cycle中断周期 Execute Cycle执行周期 Interrupt Disabled Check for Interrupt;Process Interrupt Fetch Next Instruction Execute Instruction START Interrupt Enabled HALT An Instruction cycle includes the following subcycles: 指令周期包括以下子周期

Indirect Addressing Cycle • May require memory access to fetch operands 指令的执行需要访问存储器获得操作数 • Indirect addressing requires more memory accesses 间接寻址需要额外的存储器访问 • Can be thought of as additional instruction subcycle 可以把它看成是额外的指令子周期

Instruction Cycle with Indirect

Instruction Cycle State Diagram

Data Flow (Instruction Fetch) • Depends on CPU design 指令周期期间，严格的事件序列取决于CPU的设计 • In general: • Fetch 取指令周期 • PC contains address of next instruction 开始PC拥有待取的下一条指令地址 • Address moved to MAR 将此地址送到MAR • Address placed on address bus 并放到地址总线上 • Control unit requests memory read 控制器请求读存储器 • Result placed on data bus, copied to MBR, then to IR 结果放到数据总线上并复制到MBR，然后传送到IR • Meanwhile PC incremented by 1 此时PC加1

Data Flow (Fetch Diagram) 2 3 1 6 4 5

Data Flow (Indirect Cycle) • IR is examined取指周期后控制器检查IR的内容 • If indirect addressing, indirect cycle is performed 若有一个使用间接寻址的操作数，则执行一个间址周期 • Right most N bits of MBR transferred to MAR MBR最右的N位是一个地址引用，被传送到MAR • Control unit requests memory read 控制器请求一个存储器读 • Result (address of operand) moved to MBR 得到所要求的操作数地址并送入MBR op-code address instruction format

Data Flow (Indirect Diagram) 2 1 3

Data Flow (Execute Cycle) • May take many forms 指令周期能取多种形式 • Depends on instruction being executed 取决于当前执行的指令 • May include • Memory read/write 存储器读写 • Input/Output I/O设备的读写 • Register transfers 寄存器间数据传送 • ALU operations ALU操作

Data Flow (Interrupt Cycle) • Current PC saved to allow resumption after interrupt PC的当前内容必须被保存，以便在中断之后CPU能恢复正常的动作 • Contents of PC copied to MBRPC的内容传送到MBR • Special memory location (e.g. stack pointer) loaded to MAR 一个专门的存储器位置由控制器装入MAR • MBR written to memory 将MBR的内容写到存储器 • PC loaded with address of interrupt handling routine 中断子程序的地址装入PC • Next instruction (first of interrupt handler) can be fetched 下一指令周期将以取此相应的指令而开始

Data Flow (Interrupt Diagram) 2 5 3 1 4

Pipelining 流水处理 A B C D • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold 如有4个人有衣服要洗、干、叠 • Washer takes 30 minutes 洗需30分钟 • Dryer takes 40 minutes 干40分 • “Folder” takes 20 minutes 叠20分

Sequential Laundry A B C D 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?

Pipelined Laundry 30 40 40 40 40 20 A B C D 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r • Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons（1） 30 40 40 40 40 20 A B C D • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup 6 PM 7 8 9 Time T a s k O r d e r

Pipelining Lessons（2） • 流水线对执行单个任务没有帮助，但是它能够提高整个系统的吞吐量 • 流水的改进比例受最少流水节拍的限制 • 思想：多个任务能够同时进行 • 理想加速比＝流水节拍数 • 流水节拍的不平衡降低了加速比 • 开始的填充时间和最后的排空时间也会减少加速比

Instruction Pipelining • Similar to assembly line in manufacturing plants: Products at various stages can be worked on simultaneously  Performance improved 在生产车间里，多个产品可以同时在不同的生产线上进行加工，这样就提高了效率 • First attempt: 2 stages 将指令周期分为2步 • Fetch 取指令 • Execution 执行

Prefetch • Fetch accessing main memory 从存储器取指令 • Execution usually does not access main memory 执行时通常不访问存储器 • Can fetch next instruction during execution of current instruction 执行当前指令时可以预取下一条指令 • Called instruction prefetch 称为指令预取 • Ideally instruction cycle time would be halved (if durationF = durationE …) 理想情况下指令周期会减半

Improved Performance(1) • But not doubled: 性能加倍不可能的原因 • Fetch usually shorter than execution 取指时间小于执行时间。 • Any jump or branch means that prefetched instructions are not the required instructions 任何跳转、分支指令意味着预取指令作废 • e.g., ADD A, B BEQ NEXT ADD B, C NEXT: SUB C, D

Two Stage Instruction Pipeline

Improved Performance (2) • Reduce time loss due to branching by guessing 可以通过预测来减少分支带来的时间损失 • Prefetch instruction after branching instruction 取指阶段取存储器中转移指令之后的指令 • If not branched 若转移未发生 use the prefetched instruction．没有时间损失 else 若转移发 discard the prefetched instruction 己取指令作废 fetch new instruction 并取新的指令

Pipelining • Add more stages to improve performance 流水线可以通过更多的阶段获得进一步的加速 • More stages  more speedup • FI: Fetch instruction 取指令 • DI: Decode instruction 指令译码 • CO: Calculate operands 计算操作数 • FO: Fetch operands 取操作数 • EI: Execute instructions 执行指令 • WO: Write result 写结果 • Various stages are of nearly equal duration 各阶段时间几乎相等 • Overlap these operations 这样就可以并行操作

Timing of Pipeline

Speedup of Pipelining (1) • 9 instructions 6 stages w/o pipelining: __ time units w/ pipelining: __ time units speedup = _____ • Q: 100 instructions 6 stages, speedup = ____ • Q:  instructions k stages, speedup = ____ • Can you prove it (formally)?

Pipelining - Discussion • Not all stages are needed in one instruction • e.g., LOAD: WO not needed 并不是每条指令都必须包含所有阶段 • Assume all stages can be performed in parallel • e.g., FI, FO, and WO  memory conflicts 假设所有阶段能并行执行，没有冲突 • Timing is set up assuming all stages are needed by each instruction  Simplify pipeline hardware 为简化流水线硬件设计，在假定每条指令都要求所有阶段的基础上来建立时序 • Assuming no conditional branch instructions 假设没有条件分支指令

Limitation by Branching • Conditional branch instructions can invalidate several instruction prefetches 条件分支能使多条指令作废 • In our example (see next slide) • Instruction 3 is a conditional branch to instruction 15 • Next instruction’s address won’t be known till instruction 3 is executed (at time unit 7) 指令3执行完后才知道下一条指令地址 • pipeline must be cleared • No instruction is finished from time units 9 to 12 • performance penalty 在时间9－12之间没有指令完成，导致性能惩罚

Branch in a Pipeline

Limitation by Data Dependencies • Data needed by current instruction may depend on a previous instruction that is still in pipeline 当前指令所需要的数据可能是上一条仍在执行的指令的结果 • E.g., A  B + C D  A + E

Limitation by stage overhead • Ideally, more stages, more speedup 理想情况下，指令分段越多，加速比越大 • However, • more overhead in moving data between buffers 数据在缓冲器间传送需要花费开销 • more overhead in preparation and delivery functions 完成准备和递交功能也需要开销 • more complex circuit for pipeline hardware 需要更加复杂的硬件线路

Pipeline Performance • Cycle time =max[i]+d= m+d 周期时间 m:maximum stage delay 最大段延迟k:number of stages 流水线段数 d:time delay of a latch 锁存延迟 • Execute n instruction time with pipeliningTk=[k+(n-1)] 流水执行n条指令的时间 • Time to execute n instructions without pipelining T1 = nk 非流水执行n条指令的时间 • Speedup 加速比Sk=T1/Tk=nk /[k+(n-1)]= nk /[k+(n-1)]

Pipeline Performance • Speedup of k-stage pipelining compared to without pipelining • Q: instructions k stages, speedup = ____

William Stallings Computer Organization and Architecture 5 th Edition