Advanced Computer Architecture 5MD00

RISC Instruction Set Implementation Alternatives == using MIPS as example == Advanced Computer Architecture5MD00 Nov 2013 Henk Corporaal

Topics • MIPS ISA: Instruction Set Architecture • MIPS single cycle implementation • MIPS multi-cycle implementation • MIPS pipelined implementation • Pipeline hazards • Recap of RISC principles • Other architectures • Based on the book: Computer Organization and Designch2-4 (3rd , 4th or 5th ed) • Many slides; I'll go quick andskip some

Main Types of Instructions • Arithmetic • Integer • Floating Point • Memory access instructions • Load & Store • Control flow • Jump • Conditional Branch • Call & Return

MIPS arithmetic • Most instructions have 3 operands • Operand order is fixed (destination first)Example: C code: A = B + C MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler)

MIPS arithmetic C code: A = B + C + D; E = F - A; MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0 • Operands must be registers, only 32 registers provided • Design Principle: smaller is faster. Why?

Registers vs. Memory • Arithmetic instruction operands must be registers, — only 32 registers provided • Compiler associates variables with registers • What about programs with lots of variables ? Memory CPU register file IO

Register allocation • Compiler tries to keep as many variables in registers as possible • Some variables can not be allocated • large arrays (too few registers) • aliased variables (variables accessible through pointers in C) • dynamic allocated variables • heap • stack • Compiler may run out of registers => spilling

0 8 bits of data 1 8 bits of data 2 8 bits of data 3 8 bits of data 4 8 bits of data 5 8 bits of data 6 8 bits of data ... Memory Organization • Viewed as a large, single-dimension array, with an address • A memory address is an index into the array • "Byte addressing" means that successive addresses are one byte apart

0 32 bits of data 4 32 bits of data 8 32 bits of data 12 32 bits of data Memory Organization • Bytes are nice, but most data items use larger "words" • For MIPS, a word is 32 bits or 4 bytes. • 232 bytes with byte addresses from 0 to 232-1 • 230 words with byte addresses 0, 4, 8, ... 232-4 Registers hold 32 bits of data ...

Memory layout: Alignment 31 23 15 7 0 Words are aligned • What are the least 2 significant bits of a word address? 0 this word is aligned; the others are not! 4 8 12 address 16 20 24

Instructions: load and store Example: C code: A[8] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3) • Store word operation has no destination (reg) operand • Remember arithmetic operands are registers, not memory!

Let's translate some C-code • Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli $2 , $5, 4 add $2 , $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31 Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5

op rs rt rd shamt funct 000000 10001 10010 01001 00000 100000 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Machine Language • Instructions, like registers and words of data, are also 32 bits long • Example: add $t0, $s1, $s2 • Registers have numbers: $t0=9, $s1=17, $s2=18 • Instruction Format: Can you guess what the field names stand for?

Machine Language • Consider the load-word and store-word instructions, • What would the regularity principle have us do? • New principle: Good design demands a compromise • Introduce a new type of instruction format • I-type for data transfer instructions • other format was R-type for register • Example: lw $t0, 32($s2) 35 18 9 32 op rs rt 16 bit number

Control flow • Decision making instructions • alter the control flow, • i.e., change the "next" instruction to be executed • MIPS conditional branch instructions:bne $t0, $t1, Label beq $t0, $t1, Label • Example: if (i==j) h = i + j;bne $s0, $s1, Label add $s3, $s0, $s1 Label: ....

Control flow • MIPS unconditional branch instructions: j label • Example:if (i!=j) beq $s4, $s5, Lab1 h=i+j; add $s3, $s4, $s5 else j Lab2 h=i-j; Lab1: sub $s3, $s4, $s5 Lab2: ... • Can you build a simple for loop?

op rs rt rd shamt funct op rs rt 16 bit address op 26 bit address So far: • InstructionMeaning add $s1,$s2,$s3 $s1 = $s2 + $s3sub $s1,$s2,$s3 $s1 = $s2 – $s3lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 ° $s5beq $s4,$s5,L Next instr. is at Label if $s4 = $s5j Label Next instr. is at Label • Formats: R I J

Control Flow • We have: beq, bne, what about Branch-if-less-than? • New instruction:meaning: if $s1 < $s2 then $t0 = 1 slt $t0, $s1, $s2 else $t0 = 0 • Can use this instruction to build "blt $s1, $s2, Label" — can now build general control structures • Note that the assembler needs a register to do this, — use conventions for registers

MIPS compiler/assembler Conventions

Constants • Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; • Solutions? Why not? • put 'typical constants' in memory and load them • create hard-wired registers (like $zero) for constants like 0, 1, 2, … • or ……. • MIPS Instructions: addi $29, $29, 4 slti $8, $18, 10 andi $29, $29, 6 ori $29, $29, 4 3

filled with zeros 1010101010101010 0000000000000000 1010101010101010 0000000000000000 0000000000000000 1010101010101010 ori 1010101010101010 1010101010101010 How about larger constants? • We'd like to be able to load a 32 bit constant into a register • Must use two instructions; new "load upper immediate" instructionlui $t0, 1010101010101010 • Then must get the lower order bits right, i.e.,ori $t0, $t0, 1010101010101010

Assembly Language vs. Machine Language • Assembly provides convenient symbolic representation • much easier than writing down numbers • e.g., destination first • Machine language is the underlying reality • e.g., destination is no longer first • Assembly can provide 'pseudoinstructions' • e.g., “move $t0, $t1” exists only in Assembly • would be implemented using “add $t0,$t1,$zero” • Another pseudo instr: blt $t1, $t2, label • When considering performance you should count real instructions

Addresses in Branches and Jumps • Instructions: bne $t4,$t5,LabelNext instruction is at Label if $t4  $t5 beq $t4,$t5,LabelNext instruction is at Label if $t4 = $t5 j LabelNext instruction is at Label • Formats: • Addresses are not 32 bits — How do we handle this with load and store instructions? op rs rt 16 bit address I J op 26 bit address

What's the next address? • Instructions: bne $t4,$t5,LabelNext instruction is at Label if $t4  $t5 beq $t4,$t5,LabelNext instruction is at Label if $t4 = $t5 • Formats: • Could specify a register (like lw and sw) and add it to address • use Instruction Address Register (PC = program counter) • most branches are local (principle of locality) • Jump instructions just use high order bits of PC • address boundaries of 256 MB op rs rt 16 bit address I

To summarize:

MIPS (3+2) addressing modes overview

MIPS Datapath • Building a datapath • support a subset of the MIPS-I instruction-set • A single cycle processor datapath • all instruction actions in one (long) cycle • A multi-cycle processor datapath • each instructions takes multiple (shorter) cycles • For details see book (ch 5, 3rd ed. Orch 4 in 4th ed. + app B):

Datapath and Control Registers & Memories FSM or Micro- programming Multiplexors Buses ALUs Control Datapath

The Processor: Datapath & Control • Simplified MIPS implementation to contain only: • memory-reference instructions: lw, sw • arithmetic-logical instructions: add, sub, and, or, slt • control flow instructions: beq, j • Generic Implementation: • use the program counter (PC) to supply instruction address • get the instruction from memory • read registers • use the instruction to decide exactly what to do • All instructions use the ALU after reading the registers Why? • memory-reference? • arithmetic? • control flow?

D a t a R e g i s t e r # A d d r e s s P C I n s t r u c t i o n R e g i s t e r s A L U A d d r e s s R e g i s t e r # I n s t r u c t i o n D a t a m e m o r y m e m o r y R e g i s t e r # D a t a More Implementation Details • Abstract / Simplified View: • Two types of functional units: • elements that operate on data values (combinational) • elements that contain state (sequential)

falling edge cycle time rising edge State Elements • Unclocked vs. Clocked • Clocks used in synchronous logic • when should an element that contains state be updated?

An unclocked state element • The set-reset (SR) latch • output depends on present inputs and also on past inputs R NOR Q NOR Q S R S Q 0 0 Q 0 1 1 1 0 0 1 1 ? Truth table: state change

Latches and Flip-flops • Output is equal to the stored value inside the element(don't need to ask for permission to look at the value) • Change of state (value) is based on the clock • Latches: whenever the inputs change, and the clock is asserted- level sensitive • Flip-flop: state changes only on a clock edge- edge-triggered A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written

D-latch (level-sensitive) • Two inputs: • the data value to be stored (D) • the clock signal (C) indicating when to read & store data (D) • Two outputs: • the value of the internal state (Q) and it's complement

D flip-flop (edge-triggered) • Output changes only on the clock edge D D Q D Q Q D D _ _ l a t c h l a t c h C C Q Q C

Our Implementation • An edge triggered methodology • Typical execution: • read contents of some state elements, • send values through some combinational logic, • write results to one or more state elements S t a t e S t a t e e l e m e n t C o m b i n a t i o n a l l o g i c e l e m e n t 1 2 C l o c k c y c l e

Read data 1 Read reg. #1 Read data 2 Read reg.#2 Write data Write reg.# Write Register File • 3-ported: one write, two read ports

R e a d r e g i s t e r n u m b e r 1 R e g i s t e r 0 R e g i s t e r 1 M u R e a d d a t a 1 x R e g i s t e r n – 1 R e g i s t e r n R e a d r e g i s t e r n u m b e r 2 M u R e a d d a t a 2 x Implementation of the read ports Register file: read ports • Register file built using D flip-flops

W r i t e C 0 R e g i s t e r 0 1 D n - t o - 1 C R e g i s t e r n u m b e r d e c o d e r R e g i s t e r 1 D n – 1 n C R e g i s t e r n – 1 D C R e g i s t e r n D R e g i s t e r d a t a Register file: write port • Note: we still use the real clock to determine when to write

P C S r c M A d d u x A L U A d d 4 r e s u l t S h i f t l e f t 2 R e g i s t e r s A L U o p e r a t i o n 3 R e a d M e m W r i t e A L U S r c R e a d r e g i s t e r 1 P C R e a d a d d r e s s R e a d M e m t o R e g d a t a 1 Z e r o r e g i s t e r 2 I n s t r u c t i o n A L U A L U R e a d W r i t e R e a d A d d r e s s r e s u l t M d a t a r e g i s t e r d a t a 2 M u I n s t r u c t i o n u x W r i t e m e m o r y D a t a x d a t a m e m o r y W r i t e R e g W r i t e d a t a 3 2 1 6 S i g n M e m R e a d e x t e n d Building the Datapath • Use multiplexors to stitch them together

Our Simple Control Structure • All of the logic is combinational • We wait for everything to settle down, and the right thing to be done • ALU might not produce “right answer” right away • we use write signals along with clock to determine when to write • Cycle time determined by length of the longest path S t a t e S t a t e e l e m e n t C o m b i n a t i o n a l l o g i c e l e m e n t 1 2 C l o c k c y c l e We are ignoring some details like setup and hold times !

Control • Selecting the operations to perform (ALU, read/write, etc.) • Controlling the flow of data (multiplexor inputs) • Information comes from the 32 bits of the instruction • Example:add $8, $17, $18 Instruction Format:000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct • ALU's operation based on instruction type and function code

Control 2 00: lw, sw 01: beq 10: add, sub, and, or, slt 000: and 001: or 010: add 110: sub 111: set on less than Control 1 ALU Control: 2 level implementation bit 31 6 Opcode 2 26 ALUop instruction register 3 ALUcontrol 5 6 Funct. 0

0 M u x A L U A d d 1 r e s u l t A d d S h i f t l e f t 2 R e g D s t 4 B r a n c h M e m R e a d M e m t o R e g I n s t r u c t i o n [ 3 1 – 2 6 ] C o n t r o l A L U O p M e m W r i t e A L U S r c R e g W r i t e I n s t r u c t i o n [ 2 5 – 2 1 ] R e a d R e a d r e g i s t e r 1 P C R e a d a d d r e s s d a t a 1 I n s t r u c t i o n [ 2 0 – 1 6 ] R e a d Z e r o r e g i s t e r 2 I n s t r u c t i o n 0 R e g i s t e r s A L U R e a d A L U [ 3 1 – 0 ] 0 R e a d W r i t e M d a t a 2 A d d r e s s r e s u l t 1 d a t a I n s t r u c t i o n r e g i s t e r M u M u m e m o r y x u I n s t r u c t i o n [ 1 5 – 1 1 ] W r i t e x 1 D a t a x d a t a 1 m e m o r y 0 W r i t e d a t a 1 6 3 2 I n s t r u c t i o n [ 1 5 – 0 ] S i g n e x t e n d A L U c o n t r o l I n s t r u c t i o n [ 5 – 0 ] Datapath with Control

ALU Control1 • What should the ALU do with this instructionexample: lw $1, 100($2)35 2 1 100 op rsrt 16 bit offset • ALU control input000 AND 001 OR 010 add 110 subtract 111 set-on-less-than • Why is the code for subtract 110 and not 011?

ALU Operation class, computed from instruction type ALU Control1 • Must describe hardware to compute 3-bit ALU control input • given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic • function code for arithmetic • Describe it using a truth table (can turn into gates): inputs outputs

ALU Control1 • Simple combinational logic (truth tables)

Deriving Control2 signals 9 control (output) signals Input 6-bits Determine these control signals directly from the opcodes:R-format: 0 lw: 35 sw: 43 beq: 4

Control 2 • PLA example implementation

P C S r c 1 M A d d u x A L U 0 4 A d d r e s u l t S h i f t R e g W r i t e l e f t 2 I n s t r u c t i o n [ 2 5 – 2 1 ] R e a d r e g i s t e r 1 R e a d M e m W r i t e R e a d P C d a t a 1 I n s t r u c t i o n [ 2 0 – 1 6 ] a d d r e s s R e a d M e m t o R e g A L U S r c r e g i s t e r 2 Z e r o I n s t r u c t i o n R e a d 1 A L U A L U [ 3 1 – 0 ] 1 R e a d W r i t e d a t a 2 1 A d d s s r e s u l t r e M r e g i s t e r M d a t a u I n s t r u c t i o n M u I n s t r u c t i o n [ 1 5 – 1 1 ] x W r i t e u x m e m o r y R e g i s t e r s x 0 d a t a 0 D a t a 0 W r i t e m e m o r y R e g D s t d a t a 1 6 3 2 S i g n I n s t r u c t i o n [ 1 5 – 0 ] e x t e n d A L U M e m R e a d c o n t r o l I n s t r u c t i o n [ 5 – 0 ] A L U O p Single Cycle Implementation • Calculate cycle time assuming negligible delays except: • memory (2ns), ALU and adders (2ns), register file access (1ns)

Advanced Computer Architecture 5MD00

Advanced Computer Architecture 5MD00

Presentation Transcript

Advanced Computer Architecture 5MD00 / 5Z033 Overview

Advanced Computer Architecture 5MD00 / 5Z033 Overview

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Advanced Computer Architecture 5MD00 / 5Z032

Advanced Computer Architecture 5MD00 / 5Z033 MIPS Instruction-Set Architecture

Advanced Computer Architecture 5MD00 / 5Z032 MIPS Instruction-Set Architecture

Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches

Advanced Computer Architecture 5MD00 / 5Z033 Fundamentals

Advanced Computer Architecture 5MD00 / 5Z033 Overview

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Advanced Computer Architecture 5MD00 / 5Z033 Instruction Set Design

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading

Advanced Computer Architecture 5MD00 / 5Z033 MIPS Instruction-Set Architecture

Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing 1

Advanced Computer Architecture 5MD00 / 5Z033 Instruction Set Design

Advanced Computer Architecture 5MD00 Project on Network-on-Chip

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing