COSC 3430 Computer Architecture Lecture 09: Single cycle control and Multicycle Implementation PH 3: Chapter 5 sectio

COSC 3430 Computer Architecture Lecture 09: Single cycle control and Multicycle ImplementationPH 3: Chapter 5 sections 5.4 and 5.5

Single cycle datapath control

Control • Selecting the operations to perform (ALU, read/write, etc.) • Controlling the flow of data (multiplexor inputs) • Information comes from the 32 bits of the instruction • Example: add $8, $17, $18 • Instruction Format:000000 10001 10010 01000 00000 100000 op rs rt rd shamt func • ALU's operation based on instruction type and function code

Control • e.g., what should the ALU do with this instruction • Example: lw $1, 100($2) • 35 2 1 100 op rs rt 16 bit offset • ALU control inputs as developed in B.6 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR Not all of the above are used in this simplified datapath development

ALUOp is a 2 bit output computed from instruction type Control • Must describe hardware to compute 4-bit ALU control input from • given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic • function code for arithmetic • Describe it using a truth table (can turn into gates):

Single cycle with control

Settings of the control lines from the opcode INSTRUCTION OPCODES Binary R-type 0 000000 Beq 4 000100 Lw 35 100011 Sw 43 101011

Control. Generation of the control signals from opcode

Truth table for the ALU 4 bit operation We show an implementation of this truth table with gates on the next slide. This could be considered a 3 bit output since the MSB is always 0 for our problem.

Control. Generating the 4 bit ALU operation from the ALUOp0 and ALUOp1 and the function code (bits 0-5) Example: Suppose ALUOP1 = 1 and F1 = 1 All others = 0 except ALUOP0 is X. Output should be 0110

Cycle 1 Cycle 2 Clk lw sw Waste Single Cycle Disadvantages & Advantages • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction • especially problematic for more complex instructions like floating point multiply • May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but • Is simple and easy to understand

Single Cycle Implementation (an example) • Calculate cycle time assuming negligible delays except: • memory (200ps), ALU and adders (100ps), register file access (50ps) • Assuming only the above delays, which of the following implementations would be faster and by how much? • An implementation in which every instruction operates in 1 clock cycle of a fixed length, or • An implementation where every instruction executes in 1 clock cycle using a variable-length clock, which for each instruction is only as long as it needs to be. (Such an approach is not practical, but it will allow us to see what is being sacrificed when all the instructions must execute in a single clock of the same length.)

Example continued • To compare performance, assume the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps. • First compare the CPU execution times using the equation • CPU time = Instr count × CPI × Clock cycle time, so • CPU time = IC × Clock cycle time, since CPI = 1 for both cases

Steps and times for various instructions

Example continued • The clock cycle for a machine with a single clock cycle time for all instructions will be determined by the longest instruction, which is 600ps, so CPU time = 600ps (IC). • A machine with a variable clock cycle time has an average time per instruction of CPU cycle = 600(25%) + 550(10%) + 400(45%) + 350(15%0 + 200(5%) = 447.5ps. • Since the variable clock has a shorter average clock cycle, it’s CPU time = 447.5ps (IC). The performance improvement is then 600/447.5 = 1.34.

Example continued • Hence the variable clock implementation is 1.34 times faster. • Unfortunately, implementing a variable speed clock for each instruction class is extremely difficult, and the overhead for such an approach could be larger than any advantage gained. As we will later see, an alternative is to use a shorter clock cycle that does less work and then vary the number of clock cycles for the different instruction classes. • The penalty for using a single-cycle design with a fixed clock cycle is significant, but might be acceptable for the small instruction set we are using. Early computers did exactly this. However, implementing a floating point unit for example, or an ISA with more complex instructions, wouldn’t work well at all.

Example continued • Because we must assume the clock cycle is equal to the worst-case delay for all instructions, we can’t use implementations that reduce the delay of the common case unless they also improve the worst case time. • A single cycle implementation thus violates one of our key design principles of making the common case fast.

Single Cycle Datapath with Control Unit 0 Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Where we are headed • Single Cycle Problems: • what if we had a more complicated instruction like floating point? • The clock cycle is set by the longest instruction execution time. • Even with our simplified implementation, the clock cycle time will be determined by the time for a load instruction which uses the instruction memory, register file, the ALU, data memory, and the register file again. • One Solution: A multicycle datapath • use a “smaller” cycle time • have different instructions take different numbers of cycles

Multicycle Datapath Approach • Let an instruction take more than 1 clock cycle to complete • Break up instructions into steps where each step takes a cycle while trying to • balance the amount of work to be done in each step • restrict each cycle to use only one major functional unit • Not every instruction takes thesame number of clock cycles • In addition to faster clock rates, multicycle allows functional units that can be used more than once per instruction as long as they are used on different clock cycles, as a result • only need one memory – but only one memory access per cycle • need only one ALU/adder – but only one ALU operation per cycle

IR Address Memory A Read Addr 1 PC Read Data 1 Register File Read Addr 2 Read Data (Instr. or Data) ALUout ALU Write Addr Write Data Read Data 2 B Write Data MDR Multicycle Datapath Approach, con’t • At the end of a cycle • Store values needed in a later cycle by the current instruction in an internal register (not visible to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (no write control signal needed) IR – Instruction Register MDR – Memory Data Register A, B – regfile read data registers ALUout – ALU output register • Data used by subsequent instructions are stored in programmer visible registers (i.e., register file, PC, or memory)

Next Lecture and Reminders • Next lecture • MIPS multicycle datapath and control • Reading assignment – PH, Chapter 5.5

COSC 3430 Computer Architecture Lecture 09: Single cycle control and Multicycle Implementation PH 3: Chapter 5 sectio

COSC 3430 Computer Architecture Lecture 09: Single cycle control and Multicycle Implementation PH 3: Chapter 5 sectio

Presentation Transcript

COSC 4349 and 5349 Computer Architecture

COMP541 Multicycle MIPS

The Multicycle Implementation

Computer Architecture and Design – ELEN 350

Chapter 4 Computer Design Basics

Multiple - Cycle Hardwired Control

Lecture’s Overview

55:035 Computer Architecture and Organization

Finite-State Control for Multicycle Datapath 11 October 2013

Chapter Five Part 4: Implementing Multicycle Control

CS 152 Computer Architecture and Engineering Lecture 12 Multicycle Controller Design Exceptions

William Stallings Computer Organization and Architecture

Lecture 5. MIPS Processor Design Single-cycle MIPS #1

CEG3420 Computer Design Lecture 9: Designing a Single Cycle Datapath

Lecture’s Overview

Computer Architecture: Intro Lecture 8- The Control Path

Chapter 4 Computer Design Basics

CS1104 – Computer Organization

The Big Picture: Where are We Now?

Computer Organization and Architecture Chapter 5: The Processor: Datapath and Control

Computer Architecture COSC 3430

CSE 502: Computer Architecture