EECS 470

EECS 470 Computer Architecture Lecture 3 Coverage: Appendix A

Role of the Compiler • The primary user of the instruction set • Exceptions: getting less common • Some device drivers; specialized library routines • Some small embedded systems (synthesized arch) • Compilers must: • generate a correct translation into machine code • Compilers should: • fast compile time; generate fast code • While we are at it: • generate reasonable code size; good debug support

Structure of Compilers • Front-end: translate high level semantics to some generic intermediate form • Intermediate form does not have any resource constraints, but uses simple instructions. • Back-end: translates intermediate form into assembly/machine code for target architecture • Resource allocation; code optimization under resource constraints Architects mostly concerned with optimization

Typical optimizations: CSE • Common sub-expression elimination c = array1[d+e] / array2[d+e]; c = array1[i] / arrray2[i]; • Purpose: • reduce instructions / faster code • Architectural issues: • more register pressure

Typical optimization: LICM • Loop invariant code motion for (i=0; i<100; i++) { t = 5; array1[i] = t; } • Purpose: • remove statements or expressions from loops that need only be executed once (idempotent) • Architectural issues: • more register pressure

Other transformations • Procedure inlining: better inst schedule • greater code size, more register pressure • Loop unrolling: better loop schedule • greater code size, more register pressure • Software pipelining: better loop schedule • greater code size; more register pressure • In general – “global”optimization: faster code • greater code size; more register pressure

Compiled code characteristics • Optimized code has different characteristics than unoptimized code. • Fewer memory references, but it is generally the “easy ones” that are eliminated • Example: Better register allocation retains active data in register file – these would be cache hits in unoptimized code. • Removing redundant memory and ALU operations leaves a higher ratio of branches in the code • Branch prediction becomes more important Many optimizations provide better instruction scheduling at the cost of an increase in hardware resource pressure

What do compiler writers want in an instruction set architecture? • More resources: better optimization tradeoffs • Regularity: same behavior in all contexts • no special cases (flags set differently for immediates) • Orthogonality: • data type independent of addressing mode • addressing mode independent of operation performed • Primitives, not solutions: • keep instructions simple • it is easier to compose than to fit. (ex. MMX operations)

What do architects want in an instruction set architecture? • Simple instruction decode: • tends to increase orthogonality • Small structures: • more resource constraints • Small data bus fanout: • tends to reduce orthogonality; regularity • Small instructions: • Make things implicit • non-regular; non-orthogonal; non-primative

To make faster processors • Make the compiler team unhappy • More aggressive optimization over the entire program • More resource constraints; caches; HW schedulers • Higher expectations: increase IPC • Make hardware design team unhappy • Tighter design constraints (clock) • Execute optimized code with more complex execution characteristics • Make all stages bottlenecks (Amdahl’s law)

Review of basic pipelining • 5 stage “RISC” load-store architecture • About as simple as things get • Instruction fetch: • get instruction from memory/cache • Instruction decode: • translate opcode into control signals and read regs • Execute: • perform ALU operation • Memory: • Access memory if load/store • Writeback/retire: • update register file

Pipelined implementation • Break the execution of the instruction into cycles (5 in this case). • Design a separate datapath stage for the execution performed during each cycle. • Build pipeline registers to communicate between the stages.

Stage 1: Fetch • Design a datapath that can fetch an instruction from memory every cycle. • Use PC to index memory to read instruction • Increment the PC (assume no branches for now) • Write everything needed to complete execution to the pipeline register (IF/ID) • The next stage will read this pipeline register. • Note that pipeline register must be edge triggered

M U X 1 PC + 1 + Instruction Memory/ Cache PC Instruction bits en en IF / ID Pipeline register Rest of pipelined datapath

Stage 2: Decode • Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits). • Decode can be easy, just pass on the opcode and let later stages figure out their own control signals for the instruction. • Write everything needed to complete execution to the pipeline register (ID/EX) • Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!). • Including PC+1 even though decode didn’t use it.

PC + 1 regA PC + 1 Register File Contents Of regA regB Destreg Contents Of regB Data Instruction bits en Control Signals ID / EX Pipeline register IF / ID Pipeline register Rest of pipelined datapath Stage 1: Fetch datapath

Stage 3: Execute • Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register. • The inputs are the contents of regA and either the contents of RegB or the offset field on the instruction. • Also, calculate PC+1+offset in case this is a branch. • Write everything needed to complete execution to the pipeline register (EX/Mem) • ALU result, contents of regB and PC+1+offset • Instruction bits for opcode and destReg specifiers

PC+1 +offset Alu Result Contents Of regA + Rest of pipelined datapath Contents Of regB M U X contents of regB A L U Control Signals ID / EX Pipeline register EX/Mem Pipeline register PC + 1 Stage 2: Decode datapath Control Signals

Stage 4: Memory Operation • Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register. • ALU result contains address for ld and st instructions. • Opcode bits control memory R/W and enable signals. • Write everything needed to complete execution to the pipeline register (Mem/WB) • ALU result and MemData • Instruction bits for opcode and destReg specifiers

This goes back to the MUX before the PC in stage 1. MUX control for PC input Alu Result Alu Result Data Memory Rest of pipelined datapath Memory Read Data en R/W Control Signals Mem/WB Pipeline register EX/Mem Pipeline register PC+1 +offset Stage 3: Execute datapath contents of regB Control Signals

Stage 5: Write back • Design a datapath that conpletes the execution of this instruction, writing to the register file if required. • Write MemData to destReg for ld instruction • Write ALU result to destReg for add or nand instructions. • Opcode bits also control register write enable signal.

This goes back to data input of register file M U X bits 0-2 This goes back to the destination register specifier M U X bits 16-18 register write enable Alu Result Memory Read Data Stage 4: Memory datapath Control Signals Mem/WB Pipeline register

Sample Code (Simple) • Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2 nand 4 5 6 ; reg 6 = reg 4 & reg 5 lw 2 4 20 ; reg 4 = Mem[reg2+20] add 2 5 5 ; reg 5 = reg 2 + reg 5 sw 3 7 10 ; Mem[reg3+10] =reg 7

+ + A L U M U X 1 target PC+1 PC+1 0 R0 eq? R1 regA ALU result R2 Register file regB valA M U X PC Inst mem Data memory instruction R3 ALU result mdata R4 valB R5 R6 M U X data R7 offset dest valB Bits 0-2 dest dest dest Bits 16-18 M U X Bits 22-24 op op op IF/ ID ID/ EX EX/ Mem Mem/ WB

+ + A L U M U X 1 0 0 0 0 R0 0 36 R1 0 9 R2 Register file 0 M U X PC Inst mem Data memory noop 12 R3 0 0 18 R4 7 0 R5 41 R6 M U X data 22 R7 0 dest 0 Initial State Bits 0-2 0 0 0 Bits 16-18 M U X Bits 22-24 noop noop noop IF/ ID ID/ EX EX/ Mem Mem/ WB

+ + A L U add 1 2 3 M U X 1 0 1 0 0 R0 0 36 R1 0 9 R2 Register file 0 M U X PC Inst mem Data memory add 1 2 3 12 R3 0 0 18 R4 7 0 R5 41 R6 M U X data 22 R7 0 dest 0 Fetch: add 1 2 3 Bits 0-2 0 0 0 Bits 16-18 M U X Bits 22-24 noop noop noop IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 1

+ + A L U nand 4 5 6 add 1 2 3 M U X 1 0 2 1 0 R0 0 36 R1 1 0 9 R2 Register file 2 36 M U X PC Inst mem Data memory nand 4 5 6 12 R3 0 0 18 R4 7 9 R5 41 R6 M U X data 22 R7 3 dest 0 Fetch: nand 4 5 6 Bits 0-2 3 0 0 Bits 16-18 M U X Bits 22-24 add noop noop IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 2

+ + A L U lw 2 4 20 nand 4 5 6 add 1 2 3 M U X 3 1 4 1 3 2 0 R0 0 36 R1 4 0 36 9 R2 Register file 5 18 M U X PC Inst mem Data memory lw 2 4 20 12 R3 45 0 18 R4 9 7 7 R5 41 R6 M U X data 22 R7 6 dest 9 Fetch: lw 2 4 20 Bits 0-2 3 6 3 0 Bits 16-18 M U X Bits 22-24 nand add noop IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 3

+ + A L U add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3 M U X 6 1 8 2 4 3 0 R0 0 36 R1 2 45 18 9 R2 Register file 4 9 M U X PC Inst mem Data memory add 2 5 8 12 R3 -3 0 18 R4 45 7 7 18 R5 41 R6 M U X data 22 R7 20 dest 7 Fetch: add 2 5 5 Bits 0-2 3 6 4 6 3 Bits 16-18 M U X Bits 22-24 lw nand add IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 4

+ + A L U sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add M U X 20 1 23 3 5 4 0 R0 0 45 36 R1 2 -3 9 9 R2 Register file 5 9 M U X PC Inst mem Data memory sw 3 7 10 45 R3 29 0 18 R4 -3 7 7 R5 41 R6 M U X data 22 R7 20 5 dest 18 Fetch: sw 3 7 10 Bits 0-2 6 3 4 5 4 6 Bits 16-18 M U X Bits 22-24 add lw nand IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 5

+ + A L U sw 3 7 10 add 2 5 5 lw 2 4 20 nand M U X 5 1 9 4 5 0 R0 0 -3 36 R1 3 29 9 9 R2 Register file 7 45 M U X PC Inst mem Data memory 45 R3 16 99 18 R4 29 7 7 22 R5 -3 R6 M U X data 22 R7 10 dest 7 No more instructions Bits 0-2 4 6 5 7 5 4 Bits 16-18 M U X Bits 22-24 sw add lw IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 6

+ + A L U sw 3 7 10 add 2 5 5 lw M U X 10 1 15 5 0 R0 0 36 R1 16 45 9 R2 Register file M U X PC Inst mem Data memory 45 R3 99 55 0 99 R4 16 7 R5 -3 R6 M U X data 22 R7 10 dest 22 No more instructions Bits 0-2 5 4 7 7 5 Bits 16-18 M U X Bits 22-24 sw add IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 7

+ + A L U sw 3 7 10 add M U X 1 0 R0 16 36 R1 55 9 R2 Register file M U X PC Inst mem Data memory 45 R3 0 99 22 R4 55 16 R5 -3 R6 M U X data 22 R7 dest 22 No more instructions Bits 0-2 5 7 Bits 16-18 M U X Bits 22-24 sw IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 8

+ + A L U sw M U X 1 0 R0 36 R1 9 R2 Register file M U X PC Inst mem Data memory 45 R3 99 R4 16 R5 -3 R6 M U X data 22 R7 dest No more instructions Bits 0-2 Bits 16-18 M U X Bits 22-24 IF/ ID ID/ EX EX/ Mem Mem/ WB Time: 9

Time graphs Time: 1 2 3 4 5 6 7 8 9 add nand lw add sw fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback

What can go wrong? • Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written. • Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that? • Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

EECS 470

EECS 470

Presentation Transcript

EECS 470: Computer Architecture

EECS 470 Power and Architecture

EECS 470

EECS 470 Lecture 8

EECS 470 Lecture 8

Finishing out EECS 470

EECS 470

EECS 470

EECS 470 Lecture 1

EECS 470 Power and Architecture

EECS/CS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470