Exploring the Challenges of Pipeline Design and Microcode Efficiency in Instruction Sets

CS 152, Spring 2010Section 5 Andrew Waterman University of California, Berkeley

Mystery Die

Mystery Die • NVIDIA GTX280 • 240 cores * 1.296 GHz * 3 flops/cycle • 933 GFLOPS (Nhm is 8*3G*8=192GFLOPS)

Agenda • Quiz 1 Post-Mortem • VM & Caches • Return PS1 • Graded only for completeness

Quiz 1, Q1 • Microcode for JALM offset(rs) • Corner case didn’t hurt performance • Straightforward sol’n: (27/29 points) • A <- R[rs] • B <- sExt16(imm) • MA <- A+B • A <- PC // PC = PC+4 already happened • R[31] <- A • PC <- M[MA]

Quiz 1, Q1 • Cleverer sol’n: • B <- R[rs] // use commutative property • R[31] <- A+4 // A still has old PC • A <- sExt16(imm) • MA <- A+B • PC <- M[MA] • AFAIK, this is the only 5-line solution

Quiz 1, Q1 • Common problems: • Forgetting that A already had the old PC, so took an extra cycle • Forgetting that PC was already incremented, so did R[31] <- oldPC+8 • Being overly-conservative with don’t-cares • Can destroy IR as soon as you’ve read rs, imm • Can set load-enable to DC the cycle the value is used • Almost all points deducted were nit-picks

Quiz 1, Q2 • 6-stage pipeline; new writeback at end of EX • When ALUop has proceeded to M1, the writeback value is available to insn in ID • Second write port doesn’t help the immediately-subsequent insn—just the one after it • Example insn sequence that benefits from it: • add r1, r2, r3 • sub r11, r12, r13 • add r21, r1, r23

Quiz 1, Q2 • 6-stage pipeline; new writeback at end of EX • When ALUop has proceeded to M1, the writeback value is available to insn in ID • Can remove bypass from end of M1 to end of ID • Equivalently, start of M2 to start of EX • Can also remove *ALU* bypass from end of M2 to end of ID, and end of WB to end of ID • Still needed for bypassing load results • Didn’t require this answer

Quiz 1, Q2 • 6-stage pipeline; new writeback at end of EX • Problem with precise state: • Memory address exceptions not detected til M2 • By then, a subsequent ALU op has written back • lw r1,-1(r0) // misaligned address • xor r2,r3,r4 // r2 modified anyway • Fix with interlock: • Stall any ALU op immediately following any load/store • Actually reduces control logic (interlock is already there for a load followed by a dependent ALU op)

Quiz 1, Q2 • 6-stage pipeline; new writeback at end of EX • Problem with precise state: • Memory address exceptions not detected til M2 • By then, a subsequent ALU op has written back • lw r1,-1(r0) // misaligned address • xor r2,r3,r4 // r2 modified anyway • Fix with additional read port: • Use read port to read *rd* (r2 in above example) • If lw causes trap, can then restore old value of rd

Quiz 1, Q3 • Reducing number of registers in ISA • Increases instructions/program because more registers must be spilled to the stack • Increases CPI because of load-use delay (these loads will be harder to schedule around) • Little penalty for “no effect” • Subtle: could decrease CPI for some programs with bad D$ hit rates; stack accesses will almost always hit • Smaller RF could shorten critical path

Quiz 1, Q3 • Adding a branch delay slot • Compiler can’t always fill delay slot usefully, so more NOPs => more insns/program • CPI decreases because fewer control hazards are possible. Also, new NOPs have low CPI • Small critical path reduction: don’t need control signal to squash instructions after a taken branch • Credit still given for “no effect”

Quiz 1, Q3 • Merging Execute and Memory Stages • No effect on insns/program: not ISA visible • Decreases CPI: eliminates load-use delay • NOT just because the pipeline depth is reduced • Address calculation added to critical path

Quiz 1, Q3 • Microcoded CISC -> pipelined RISC • Increases insns/program: CISCs take fewer insns to encode a given program • Decreases CPI: RISC pipelines can sustain CPIs close to 1, whereas microcoded machines take several clocks per insn • Toss-up on seconds/cycle • Bypasses and extra control signals in pipeline are slow • Shared bus in microcoded machine could be slow, too

VM & Caches

Exploring the Challenges of Pipeline Design and Microcode Efficiency in Instruction Sets