CSC 2405 Computer Systems II

CSC 2405Computer Systems II Advanced Topics

Instruction Set Architecture

Application Program Compiler OS ISA CPU Design Circuit Design Chip Layout Instruction Set Architecture • Assembly Language View • Processor state • Registers, memory, … • Instructions • addl, movl, leal, … • How instructions are encoded as bytes • Layer of Abstraction • Above: how to program machine • Processor executes instructions in a sequence • Below: what needs to be built • Use variety of tricks to make it run fast • E.g., execute multiple instructions simultaneously Chapter 4

Instruction Set Architectures Basic ISA Classes The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B. Registers are the class that won out. The more registers on the CPU, the better. Chapter 4

80x86 Instruction Frequency Chapter 4

Relative Frequency of Control Instructions Design hardware to handle branches quickly, since these occur most frequently Chapter 4

CISC Instruction Sets • Complex Instruction Set Computer • Dominant style through mid-80’s • Stack-oriented instruction set • Use stack to pass arguments, save program counter • Explicit push and pop instructions • Arithmetic instructions can access memory • addl %eax, 12(%ebx,%ecx,4) • requires memory read and write • Complex address calculation • Condition codes • Set as side effect of arithmetic and logical instructions • Philosophy • Add instructions to perform “typical” programming tasks Chapter 4

RISC Instruction Sets • Reduced Instruction Set Computer • Internal project at IBM, later popularized by Hennessy (Stanford) and Patterson (Berkeley) • Fewer, simpler instructions • Might take more to get given task done • Can execute them with small and fast hardware • Register-oriented instruction set • Many more (typically 32) registers • Use for arguments, return pointer, temporaries • Only load and store instructions can access memory • Similar to Y86 mrmovl and rmmovl • No Condition codes • Test instructions return 0/1 in register Chapter 4

Example RISC Instruction Formats Register-Register (R-type) ADD R1, R2, R3 6 5 11 10 31 26 25 21 20 16 15 0 Op rs1 rs2 rd func (ALI reg. operations, read/write special registers and moves) Register-Immediate (I-type) SUB R1, R2, #3 31 26 25 21 20 16 15 0 immediate Op rs1 rd (ALU imm. operations, loads and stores, conditional branch, jump (and link) Jump / Call (J-type) JUMP end 31 26 25 0 offset added to PC Op (jump, jump and link, trap and return from exception) Chapter 4

CISC vs. RISC • Original Debate • Strong opinions! • CISC proponents---easy for compiler, fewer code bytes • RISC proponents---better for optimizing compilers, can make run fast with simple chip design • Current Status • For desktop processors, choice of ISA not a technical issue • With enough hardware, can make anything run fast • Code compatibility more important • For embedded processors, RISC makes sense • Smaller, cheaper, less power Chapter 4

Logic Design

Overview of Logic Design • Fundamental Hardware Requirements • Communication • How to get values from one place to another • Computation • Storage • Bits are Our Friends • Everything expressed in terms of values 0 and 1 • Communication • Low or high voltage on wire • Computation • Compute Boolean functions • Storage • Store bits of information Chapter 4

0 1 0 Voltage Time Digital Signals • Use voltage thresholds to extract discrete values from continuous signal • Simplest version: 1-bit signal • Either high range (1) or low range (0) • With guard range between them • Not strongly affected by noise or low quality circuit elements • Can make circuits simple, small, and fast Chapter 4

a && b Computing with Logic Gates • Outputs are Boolean functions of inputs • Respond continuously to changes in inputs • With some, small delay Falling Delay Rising Delay b Voltage a Time Chapter 4

Acyclic Network Primary Inputs Primary Outputs Combinational Circuits • Acyclic Network of Logic Gates • Continuously responds to changes on primary inputs • Primary outputs become (after some delay) Boolean functions of primary inputs Chapter 4

Bit equal a eq b Bit Equality • Generate 1 if a and b are equal • Hardware Control Language (HCL) • Very simple hardware description language • Boolean operations have syntax similar to C logical operations • We’ll use it to describe control logic for processors HCL Expression bool eq = (a&&b)||(!a&&!b) Chapter 4

b31 Bit equal eq31 a31 b30 Bit equal eq30 a30 Eq B = Eq b1 Bit equal eq1 A a1 b0 Bit equal eq0 a0 Word Equality Word-Level Representation • 32-bit word size • HCL representation • Equality operation • Generates Boolean value HCL Representation bool Eq = (A == B) Chapter 4

D R Q+ Q– C S Latching Storing d !d !d !d d d !d 0 !q q 1 0 !q q d d !d 0 1-Bit Latch D Latch Data Clock Chapter 4

i7 D o7 Q+ C i6 D o6 Q+ C i5 D o5 Q+ C i4 D o4 Q+ I O C i3 D o3 Q+ C i2 D o2 Q+ C i1 Clock D o1 Q+ C i0 D o0 Q+ C Clock Registers Structure • Stores word of data • Different from program registers seen in assembly code • Collection of edge-triggered latches • Loads input on rising edge of clock Chapter 4

valA Register file srcA A valW Read ports W dstW Write port valB srcB B Clock Random-Access Memory • Stores multiple words of memory • Address input specifies which word to read or write • Register file • Holds values of program registers • %eax, %esp, etc. • Register identifier serves as address • ID 8 implies no read or write performed • Multiple Ports • Can read and/or write multiple words in one cycle • Each has separate address and data input/output Chapter 4

Basic Logic Gates NOTE: okay to use just a circle for NOT:  Chapter 4

More than 2 Inputs? • AND/OR can take any number of inputs. • AND = 1 if all inputs are 1. • OR = 1 if any input is 1. • Similar for NAND/NOR. • Can implement with multiple two-input gates Chapter 4

Logical Completeness • Can implement ANY truth table with AND, OR, NOT. 1. AND combinations that yield a "1" in the truth table. 2. OR the resultsof the AND gates. Chapter 4

DeMorgan's Law • Converting AND to OR (with some help from NOT) • Consider the following gate: To convert AND to OR (or vice versa), invert inputs and output. Chapter 4

Decoder • n inputs, 2n outputs • exactly one output is 1 for each possible input pattern 2-bit decoder Chapter 4

Sequential Processors

newPC Sequential HW Structure PC valE , valM Write back valM • State • Program counter register (PC) • Condition code register (CC) • Register File • Memories • Access same memory space • Data: for reading/writing program data • Instruction: for reading instructions • Instruction Flow • Read instruction at address specified by PC • Process through stages • Update program counter Data Data Memory memory memory Addr , Data valE CC CC ALU ALU Execute Bch aluA , aluB valA , valB srcA , srcB Decode A A B B dstA , dstB M M Register Register Register Register file file file file E E icode , ifun valP rA , rB valC Instruction PC Instruction PC memory increment Fetch memory increment PC Chapter 4

newPC Seqential Stages PC valE , valM Write back valM • Fetch • Read instruction from instruction memory • Decode • Read program registers • Execute • Compute value or address • Memory • Read or write data • Write Back • Write program registers • PC • Update program counter Data Data Memory memory memory Addr , Data valE CC CC ALU ALU Execute Bch aluA , aluB valA , valB srcA , srcB Decode A A B B dstA , dstB M M Register Register Register Register file file file file E E icode , ifun valP rA , rB valC Instruction PC Instruction PC memory increment Fetch memory increment PC Chapter 4

Optional Optional D icode 5 0 rA rB ifun rA rB valC Instruction Decoding • Instruction Format • Instruction byte icode:ifun • Optional register byte rA:rB • Optional constant word valC Chapter 4

Sequential Summary • Implementation • Express every instruction as series of simple steps • Follow same general flow for each instruction type • Assemble registers, memories, predesigned combinational blocks • Connect with control logic • Limitations • Too slow to be practical • In one cycle, must propagate through instruction memory, register file, ALU, and data memory • Would need to run clock very slowly • Hardware units only active for fraction of clock cycle Chapter 4

Pipelined Processors

What is Pipelining • Computers execute billions of instructions, so instruction throughput is what matters • IDEA: Divide instruction execution up into several pipeline stages. For example IF ID EX MEM WB • Simultaneously have different instructions in different pipeline stages • The length of the longest pipeline stage determines the cycle time • Desirable pipeline features (e.g., RISC): • all instructions same length • registers located in same place in instruction format • memory operands only in loads or stores Chapter 4

A B C D What Is Pipelining Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes Chapter 4

6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C D What Is Pipelining Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? Chapter 4

30 40 40 40 40 20 A B C D What Is Pipelining Start work ASAP • Pipelined laundry takes 3.5 hours for 4 loads 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r Chapter 4

30 40 40 40 40 20 A B C D What Is Pipelining Pipelining Lessons • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Numberpipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup 6 PM 7 8 9 Time T a s k O r d e r Chapter 4

Idea Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed Parallel Sequential Pipelined Real-World Pipelines: Car Washes Chapter 4

OP1 A A A B B B C C C OP2 OP3 OP1 Time OP2 Time OP3 Pipeline Diagrams • Unpipelined • Cannot start new operation until previous one completes • 3-Way Pipelined • Up to 3 operations in process simultaneously Chapter 4

R e g Combinational logic Clock OP1 OP2 OP3 Time Data Dependencies • System • Each operation depends on result from preceding one Chapter 4

A A A A B B B B C C C C Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g OP1 OP2 OP3 OP4 Time Clock Data Hazards • Result does not feed back around in time for next operation • Pipelining has changed behavior of system Chapter 4

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Ifetch Instr 4 One Memory Port/Structural Hazards Chapter 4

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3 How do you “bubble” the pipe? Chapter 4

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazard on R1 Time (clock cycles) Chapter 4

Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 Chapter 4

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. Chapter 4

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. Chapter 4

Data Forwarding • Naïve Pipeline • Register isn’t written until completion of write-back stage • Source operands read from register file in decode stage • Needs to be in register file at start of stage • Observation • Value generated in execute or memory stage • Trick • Pass value directly from generating instruction to decode stage • Needs to be available at end of decode stage Chapter 4

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Forwarding to Avoid Data Hazard Time (clock cycles) Chapter 4

CSC 2405 Computer Systems II

CSC 2405 Computer Systems II

Presentation Transcript

CSC 222: Computer Programming II Spring 2005

CSC 8400 Computer Systems

CSC 2405: Computer Systems II

CSC 2405 Computer Systems II

CSC 222: Computer Programming II Spring 2004

CSC 222: Computer Programming II Spring 2005

CSC 142 Computer Science II

CSC 222: Computer Programming II Spring 2004

CSC 222: Computer Programming II Spring 2005

CSC 142 Computer Science II

CSC 222: Computer Programming II Spring 2004

CSC 2400 Computer Systems I

CSC 142 Computer Science II

CSC 142 Computer Science II

CSC 142 Computer Science II

CSC 222: Computer Programming II Spring 2004

CSC 2400: Computer Systems

CSC 142 Computer Science II

CSC 142 Computer Science II

CSC 222: Computer Programming II Spring 2005

CSC 142 Computer Science II

CSC 222: Computer Programming II Spring 2004