1 / 77

CMPE382 Fall 2002

CMPE382 Fall 2002. Instruction Sets section 2.1 – 2.12. Based on Hennessy & Patterson CA:AQA Adapted from Patterson’s CS252 lecture slides at Berkeley http://www.cs.berkeley.edu/~pattrsn/ with additions from Culler http://www.cs.berkeley.edu/~culler/

Télécharger la présentation

CMPE382 Fall 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPE382 Fall 2002 Instruction Sets section 2.1 – 2.12 Based on Hennessy & Patterson CA:AQA Adapted from Patterson’s CS252 lecture slides at Berkeley http://www.cs.berkeley.edu/~pattrsn/ with additions from Culler http://www.cs.berkeley.edu/~culler/ and Amaral http://www.cs.ualberta.ca/~amaral/

  2. Mental Homework In some machines, R0 will always contain zero, regardless of what is written to it. How many 68000 (or other CISC) instructions can you implement using ADD with “R0”?

  3. Mental Homework Solution MOVE R1,R2 CLR R1 NOP as well as INC R1 DEC R1 ADD R1,R2 (two operand) Flexible! Why invent new instructions when a few instructions can do a lot?

  4. Homework assignment • Hennesy & PattersonQuestions 1.16, 2.4Due Sep 20

  5. Flavours of Instruction Set Architectures

  6. Examples of use • {0,1,2, or 3} operands on ALU operations try in each ISA int a,b,c,d; a = b + c + 42; and a = b + c + 42; d = b – c;

  7. Memory addressing • Byte • formerly the minimum unit you can access in computer memory • Now 8bits, thanks to IBM • Byte size on 64bit Alpha? 1 byte = byte (char) 2 bytes = half word 16bits 4 bytes = word 32bits (int, float) 8 bytes = double word 64bits (64b int, double) 16 bytes = quad word 128bits

  8. Which Endian? • Arabic numerals are Big EndianBig end first, e.g “1984” • Only distinguishable when same memory contents are accessed as different data types, or data is passed between machines of different endian-ness • Little endian: byte 0 is least significant byteinterpreted [7 6 5 4 3 2 1 0] • Big endian: byte 0 is most significant byteinterpreted [0 1 2 3 4 5 6 7]

  9. Data alignment Options for interpreting data misaligned to machine word boundaries: • Abort (illegal alignment) • Trap to OS to fix in software • Provide hardware to hide the two necessary memory accesses

  10. H&P assume you already read P&HComputer Organization & Design So, we’ll mention the basics of pipelining now and look at the details in Appendix A after Chapter 2

  11. Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Execution Cycle Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction

  12. What’s a Clock Cycle? • Old days: 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays • clock propagation, wire lengths, drivers D-flipflop or register combinational logic

  13. Next Instruction NI NI NI NI NI IF IF IF IF IF D D D D D Instruction Fetch E E E E E W W W W W Decode & Operand Fetch Execute Store Results Fast, Pipelined Instruction Interpretation Instruction Address Instruction Register Time Operand Registers Result Registers Registers or Mem

  14. Alternative 5-stage pipeline • IF/ID instruction fetch and decode • RD register read • EX execute = ALU operation • MEM read or write memory • WB register write back results

  15. A B C D Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r

  16. 30 40 40 40 40 20 A B C D Pipelined LaundryStart work ASAP 6 PM Midnight 7 8 9 11 10 • Pipelined laundry takes 3.5 hours for 4 loads (4h synchronous) Time T a s k O r d e r

  17. 30 40 40 40 40 20 A B C D Pipelining Lessons • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup 6 PM 7 8 9 Time T a s k O r d e r

  18. Instruction Pipelining • Execute billions of instructions, so throughput is what matters • except when? • What is desirable in instruction sets for pipelining? • Variable length instructions vs. all instructions same length? • Memory operands part of any operation vs. memory operands only in loads or stores? • Register operand many places in instruction format vs. registers located in same place?

  19. Example: MIPS (Note register location) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op

  20. Adder 4 Address Inst ALU 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX Next SEQ PC Zero? RS1 Reg File MUX RS2 Memory Data Memory L M D RD MUX MUX Sign Extend Imm WB Data

  21. MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5 Steps of MIPS Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage

  22. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining Time (clock cycles) I n s t r. O r d e r

  23. How to design an ISA that resists pipelining and superscalar • Use a processor status word • Variable length • Opcodes • Variable length operands • Variable length operands who’s length can’t be determined by the first word of the instruction • how can we determine where the next instruction is to put it in the pipeline? • Lots of addressing modes on every instruction • Anything else to help make the CPI highly variable

  24. Architecture MethodologyDynamic usage “A program executes 90% of its instructions in 10% of its code.” • Can’t determine dynamic usage from static code or compiler information • Require execution traces or simulation using real data

  25. Numerous CISC addressing modes Addressing mode Example Meaning Register Add R4,R3 R4¬ R4+R3 Immediate Add R4,#3 R4 ¬ R4+3 Displacement Add R4,100(R1) R4 ¬ R4+Mem[100+R1] Register indirect Add R4,(R1) R4 ¬ R4+Mem[R1] Indexed / Base Add R3,(R1+R2) R3 ¬ R3+Mem[R1+R2] Direct or absolute Add R1,(1001) R1 ¬ R1+Mem[1001] Memory indirect Add R1,@(R3) R1 ¬ R1+Mem[Mem[R3]] Auto-increment Add R1,(R2)+ R1 ¬ R1+Mem[R2]; R2 ¬ R2+d Auto-decrement Add R1,–(R2) R2 ¬ R2–d; R1 ¬ R1+Mem[R2] Scaled Add R1,100(R2)[R3] R1 ¬ R1+Mem[100+R2+R3*d] Why Auto-increment/decrement? Scaled?

  26. Cost to implement addressing modes operations (and hardware) to move memory to register: Dmem+Imem(imm)+ALU+reg 0+0+0+2 Register 1+0+0+1 Immediate 1+1+1+2 Displacement 1+0+0+2 Register indirect 1+0+1+3 Indexed / Base 1+1+0+1 Direct or absolute 2+0+0+2 Memory indirect 1+0+1+3 Auto-increment (two reg write ports) 1+0+1+3 Auto-decrement 1+1+3+3 Scaled

  27. Are those addressing modes used?VAX running SPEC89 - dynamic usage

  28. Addressing mode usage 3 desktop programs: •Displacement: 42% avg, 32% to 55% •Immediate: 33% avg, 17% to 43% •Register deferred (indirect): 13% avg, 3% to 24% •Scaled: 7% avg, 0% to 16% •Memory indirect: 3% avg, 1% to 6% •Misc: 2% avg, 0% to 3% 75% displacement & immediate 88% displacement, immediate & register indirect optimize the common case

  29. Size of displacement and offset • Too many bits increases size of all instructions • Too few bits requires piecing together literals • Instruction should be a “nice” size

  30. Typical RISC addressing modes • Less than up to 16 CISC addressing modes • Register and immediate addressing mode on all ALU instructions • Load/store architecture (only route to data memory via these instructions) • Implement displacement addressing modeLD R1,314(R2) • for free: Register indirectLD R1,0(R2) • also for free: absoluteLD R1,42(R0)

  31. A "Typical" RISC • 32-bit fixed format instruction (3 formats) • Memory access only via load/store instructions • 32 32-bit GPR (R0 contains zero, DP take pair) – 64-bit addresses => 64-bit registers; all desktop RISCs today • 3-address, reg-reg arithmetic instruction; registers in same place • Single address mode for load/store: base + displacement – no indirection (except via registers) • Simple branch conditions

  32. Categories of instructionsH&P fig 2.15 • Integer ALU (arithmetic & logical) • Data transfer (load, store) • Control (branch, jump, call, return, trap)look at these 3 first • System (interrupts, IO, VM, protection) • Floating Point • Decimal (any COBOL programmers in the room?) • String • Graphics (fast pixel & vertex operations)

  33. Example: MIPS (Note register location) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op

  34. MIPS32 data transfer instructions SW 500(R4), R3 Store word SH 502(R2), R3 Store half SB 41(R3), R2 Store byte LW R1, 30(R2) Load word LH R1, 40(R3) Load halfword LHU R1, 40(R3) Load halfword unsigned LB R1, 40(R3) Load byte LBU R1, 40(R3) Load byte unsigned LUI R1, 40 Load Upper Immediate (16 bits shifted left by 16)

  35. MIPS32 arithmetic instructions add add $1,$2,$3 $1 = $2 + $3 3 operands; subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; add immediate addi $1,$2,100 $1 = $2 + 100 + constant; add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 + constant; multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder Hi = $2 mod $3 divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned Hi = $2 mod $3 Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi Move from Lo mflo $1 $1 = Lo Used to get copy of Lo why the departure from 32 elegant registers?

  36. MIPS logical instructions and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND or or $1,$2,$3 $1 = $2 | $3 3 reg. operands; Logical OR xor xor $1,$2,$3 $1 = $2 XOR $3 3 reg. operands; Logical XOR nor nor $1,$2,$3 $1 = ~($2 |$3) 3 reg. operands; Logical NOR and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant or immediate ori $1,$2,10 $1 = $2 | 10 Logical OR reg, constant xor immediate xori $1, $2,10 $1 = $2 XOR10 Logical XOR reg, constant shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend) shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable

  37. Branch offset size vs. frequencyH&P fig 20

  38. MIPS jump, branch, compare instructionsNo processor status word branch on equal beq $1,$2,100 if ($1 == $2) go to PC+4+100*4 Equal test; PC relative branch branch on not eq. bne $1,$2,100 if ($1!= $2) go to PC+4+100*4 Not equal test; PC relative set on less than slt $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; 2’s comp. set less than imm. slti $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; 2’s comp. set less than uns. sltu $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; natural numbers set l. t. imm. uns. sltiu $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; natural numbers jump j 10000 go to 10000*4 Jump to target address jump register jr $31 go to $31 For switch, procedure return jump and link jal 10000 $31 = PC + 4; go to 10000*4 For procedure call

  39. Test and branch methods • Compute condition, leave in GPRMIPS, Alpha • Compare and branchVAX, PA-RISC • Condition code bits in PSW80x86, 68000, SPARC, PowerPC

  40. MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Utilization of hardware resources? Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage

  41. MIPS integer datapath includes: • “3-port” register file: 2 read, 1 write • ALU (2 operands, 1 result) • comparator (equal / not-equal) • read/write data memory port • instruction memory port & decode

  42. MIPS Floating Point • 64 and 32 bit precision (double, single) • 32 FP 64-bit registers F0..F31 • + - * /, transfers, compares • multiply accumulate • paired operation (pairs of 32-bit single precision operands simultaneously in a 64-bit datapath)

  43. Embedded Processors • Big market,“Cars have more value in Silicon than steel” • ~No new applications after manufacture • Single address space, no virtual memory • Floating point rare • Memory (+ code size), cost, size, power sensitive • Controllers • simple RISC processors • Digital Signal Processors • special instructions for DSP inner loops • Multiplier accumulator at the heart of it all

  44. Are you a computer architect yet?Towards DSP processors • What do DSP processors do? • What HW do we need to execute the inner loop of an FIR filter with throughput 1/cycle? • What is an FIR filter? • How would you code it?

  45. Finite Impulse Response Filter #define SIZE 1024 float weight[SIZE]; float fir(float in) /* What datatype? */ { float out; int j; static float cbuffer[SIZE]; /*circular buffer */ static int start; out = 0.0; cbuffer[start] = in; for(j=0; j<SIZE; j++) out += cbuffer[(start-j)%SIZE] * weight[j]; start = (start + 1) % SIZE; return (out); }

  46. Optimized MIPS64 R4000 Assembler – FIR inner loop $L6: subu $3,$7,$5 bgez $3,$L7 move $2,$3 addu $2,$3,1023 $L7: sra $2,$2,10 sll $2,$2,10 subu $2,$3,$2 dsll $3,$2,2 dsll $4,$5,2 daddu $3,$3,$9 daddu $4,$4,$8 l.s $f0,0($3) l.s $f1,0($4) addu $5,$5,1 slt $2,$5,1024 mul.s $f0,$f0,$f1 bne $2,$0,$L6 add.s $f2,$f2,$f0

  47. All of the FIR code – for the curious .file 1 "fir.c" .abicalls .section .text .version "01.01" gcc2_compiled.: .lcomm cbuffer.3,4096 .lcomm start.4,4 .comm weight,4096 .text .align 2 .globl fir .ent fir fir: .frame $sp,32,$31 # vars= 0, regs= 1/0, args= 0, extra= 16 .mask 0x10000000,-16 .fmask 0x00000000,0 dsubu $sp,$sp,32 sd $28,16($sp) .set noat lui $1,%hi(%neg(%gp_rel(fir))) addiu $1,$1,%lo(%neg(%gp_rel(fir))) daddu $gp,$1,$25 .set at lw $2,start.4 dla $4,cbuffer.3 mtc1 $0,$f2 dsll $3,$2,2 move $6,$2 daddu $3,$3,$4 move $9,$4 move $7,$6 s.s $f12,0($3) move $5,$0 dla $8,weight $L6: subu $3,$7,$5 .set noreorder .set nomacro bgez $3,$L7 move $2,$3 .set macro .set reorder addu $2,$3,1023 $L7: sra $2,$2,10 sll $2,$2,10 subu $2,$3,$2 dsll $3,$2,2 dsll $4,$5,2 daddu $3,$3,$9 daddu $4,$4,$8 l.s $f0,0($3) l.s $f1,0($4) addu $5,$5,1 slt $2,$5,1024 mul.s $f0,$f0,$f1 .set noreorder .set nomacro bne $2,$0,$L6 add.s $f2,$f2,$f0 .set macro .set reorder addu $3,$6,1 .set noreorder .set nomacro bgez $3,$L9 move $2,$3 .set macro .set reorder addu $2,$6,1024 $L9: sra $2,$2,10 sll $2,$2,10 subu $2,$3,$2 sw $2,start.4 mov.s $f0,$f2 ld $28,16($sp) #nop .set noreorder .set nomacro j $31 daddu $sp,$sp,32 .set macro .set reorder .end fir

  48. DSP Data Path: Multiplier • Specialized hardware performs all key arithmetic operations in 1 cycle • ~ 50% of instructions can involve multiplier=> single cycle throughput multiplier • Need to perform multiply-accumulate (MAC) • n-bit multiplier => 2n-bit product

  49. DSP addressing modes • 4 common RISC addressing modes account for 70% of dynamic usage • immediate • displacement • register indirect • direct (or absolute) • autoincrement/decrement with/without circular buffer (modulo) features account for 25% • autoincrement with bit reversing rarely used but makes FFT faster • e.g. memory sequence order {0,4,2,6,1,5,3,7}

  50. DSP Memory • FIR Tap implies multiple memory accesses • DSPs want multiple data ports • May have multiple data memories Some DSPs have ad hoc techniques to reduce memory bandwidth demand – Instruction repeat buffer: do 1 instruction 256 times – Often disables interrupts, thereby increasing interrupt response time • Some recent DSPs have instruction caches – Even then may allow programmer to “lock in” instructions into cache – Option to turn cache into fast program memory • No DSPs have data caches

More Related