Midterm Exam Review

Midterm Exam Review

Exam Format • We will have 5 questions in the exam • One question: true/false which covers general topics. • 4 other questions • Either require calculation • Filling pipelining tables

General IntroductionTechnology trends, Cost trends, and Performance evaluation

Technology Programming Languages Applications Computer Architecture: • Instruction Set Design • Organization • Hardware Operating Measurement & Evaluation History Systems Computer Architecture • Definition: Computer Architecture involves 3 inter-related components • Instruction set architecture (ISA) • Organization • Hardware

Three Computing Markets • Desktop • Optimize price and performance (focus of this class) • Servers • Focus on availability, scalability, and throughput • Embedded computers • In appliances, automobiles, network devices … • Wide performance range • Real-time performance constraints • Limited memory • Low power • Low cost

Trends in Technology • Trends in computer technology generally followed closely Moore’s Law “Transistor density of chips doubles every 1.5-2.0 years”. • Processor Performance • Memory/density density • Logic circuits density and speed • Memory access time and disk access time do not follow Moore’s Law, and create a big gap in processor/memory performance.

MOORE’s LAW Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Trends in Cost • High volume products lowers manufacturing costs (doubling the volume decreases cost by around 10%) • The learning curve lowers the manufacturing costs – when a product is first introduced it costs a lot, then the cost declines rapidly. • Integrated circuit (IC) costs • Die cost • IC cost • Dies per wafer • Relationship between cost and price of whole computers

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Metrics for Performance • The hardware performance is one major factor for the success of a computer system. • response time (execution time) - the time between the start and completion of an event. • throughput - the total amount of work done in a period of time. • CPU time is a very good measure of performance (important to understand) (e.g., how to compare 2 processors using CPU time, CPI – How to quantify an improvement using CPU time). CPU Time = I x CPI x C

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Factors Affecting CPU Performance Instruction Count I CPI Clock Cycle C Program X X X Compiler X Instruction Set Architecture (ISA) X X X X Organization X Technology

Using Benchmarks to Evaluate and Compare the Performance of Different Processors The most popular and industry-standard set of CPU benchmarks. • SPEC CPU2006: • CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs) • Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of SPECint2000 = SPECfp2000 = 100 • How to summarize performance • Arithmetic mean • Weighted arithmetic mean • Geometric mean (this is what the industry uses)

Other measures of performance • MIPS • MFLOPS • Amdhal’s law: Suppose that enhancement E accelerates a fraction F of the execution time (NOT Frequency) by a factor S and the remainder of the time is unaffected then (Important to understand): Execution Time with E = ((1-F) + F/S) X Execution Time without E 1 Speedup (E) = ---------------------- (1 - F) +F/S

Instruction Set Architectures

Instruction Set Architecture (ISA) software instruction set hardware

f1 f2 f5 f3 f4 i p s q j fp f3 The Big Picture SPEC Requirements Problem Focus Algorithms f2() { f3(s2, &j, &i); *s2->p = 10; i = *s2->q + i; } Prog. Lang./OS i1: ld r1, b <p1> i2: ld r2, c <p1> i3: ld r5, z <p3> i4: mul r6, r5, 3 <p3> i5: add r3, r1, r2 <p1> ISA uArch Performance Focus Circuit Device

Classifying ISA • Memory-memory architecture • Simple compilers • Reduced number of instructions for programs • Slower in performance (processor-memory bottleneck) • Memory-register architecture • In between the two. • Register-register architecture (load-store) • Complicated compilers • Higher memory requirements for programs • Better performance (e.g., more efficient pipelining)

Memory addressing & Instruction operations • Addressing modes • Many addressing modes exit • Only few are frequently used (Register direct, Displacement, Immediate, Register Indirect addressing) • We should adopt only the frequently used ones • Many opcodes (operations) have been proposed and used • Only few (around 10) are frequently used through measurements

RISC vs. CISC • Now there is not much difference between CISC and RISC in terms of instructions • The key difference is that RISC has fixed-length instructions and CISC has variable length instructions • In fact, internally the Pentium/AMD have RISC cores.

32-bit vs. 64-bit processors • The only difference is that 64-bit processors have registers of size 64 bits, and have a memory address of 64 bits wide. So accessing memory may be faster. • Their instruction length is independent from whether they are 64-bit or 32-bit processors • They can access 64 bits from memory in one clock cycle

Pipelining

Computer Pipelining • Pipelining is an implementation technique where multiple operations on a number of instructions are overlapped in execution. • An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. • Each step is called a pipe stage or a pipe segment. • Throughput of an instruction pipeline is determined by how often an instruction exists the pipeline. • The time to move an instruction one step down the line is equal to the machine cycle and is determined by the stage with the longest processing delay.

Pipelining: Design Goals • An important pipeline design consideration is to balance the length of each pipeline stage. • Pipelining doesn’t help latency of single instruction, but it helps throughput of entire program • Pipeline rate is limited by the slowest pipeline stage • Under these ideal conditions: • Speedup from pipelining equals the number of pipeline stages • One instruction is completed every cycle, CPI = 1.

A 5-stage Pipelined MIPS Datapath

Consider the following instruction sequence: lw $r0, 10($r1) sw $sr3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10 Pipelined Example - Executing Multiple Instructions

Executing Multiple InstructionsClock Cycle 1 LW

Executing Multiple InstructionsClock Cycle 2 SW LW

Executing Multiple InstructionsClock Cycle 3 ADD SW LW

Executing Multiple InstructionsClock Cycle 4 ADD SW LW SUB

Executing Multiple InstructionsClock Cycle 5 SUB ADD SW LW

Executing Multiple InstructionsClock Cycle 6 SUB ADD SW

Executing Multiple InstructionsClock Cycle 7 ADD SUB

Executing Multiple InstructionsClock Cycle 8 SUB

Processor Pipelining • There are two ways that pipelining can help: • Reduce the clock cycle time, and keep the same CPI • Reduce the CPI, and keep the same clock cycle time CPU time = Instruction count * CPI * Clock cycle time

Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X Hz

Pipeline Registers ADD ADD M U X M E U X X T N D Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X*5 Hz 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 ADDR Data RD Memory 16 32 WD

Reduce the CPI, and keep the same cycle time CPI = 5 Clock = X*5 Hz

Pipeline Registers ADD ADD 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 M ADDR U X Data RD Memory M E U X 16 32 X WD T N D Reduce the CPI, and keep the same cycle time CPI = 1 Clock = X*5 Hz

IF ID EX MEM1 MEM2 WB IF ID EX MEM WB 5 ns 5 ns 4 ns 4 ns 5 ns 5 ns 5 ns 10 ns 5 ns 4 ns 4 ns Pipelining: Performance • We looked at the performance (speedup, latency, CPI) of pipelined under many settings • Unbalanced stages • Different number of stages • Additional pipelining overhead

Pipelining is Not That Easy for Computers • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards • Data hazards • Control hazards • A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline • We looked at the performance of pipelines with hazards

Reg ALU Reg Techniques to Reduce Stalls • Structural hazards • Memory: Separate instruction and data memory • Registers: Write 1st half of cycle and read 2nd half of cycle Mem Mem

Data Hazard Classification • Different Types of Hazards (We need to know) • RAW (read after write) • WAW (write after write) • WAR (write after read) • RAR (read after read): Not a hazard. • RAW will always happen (true dependence) in any pipeline • WAW and WAR can happen in certain pipelines • Sometimes it can be avoided using register renaming

Zero? MUX MUX D/A Buffer A/M Buffer M/W Buffer ALU Data Memory Techniques to Reduce data hazards • Hardware Schemes to Reduce: • Data Hazards • Forwarding

Pipeline with Forwarding: Could avoid stalls A set of instructions that depend on the DADD result uses forwarding paths to avoid the data hazard

Stall Stall Techniques to Reduce Stalls • Software Schemes to Reduce: • Data Hazards • Compiler Scheduling: reduce load stalls Scheduled code with no stalls: LD Rb,b LD Rc,c LD Re,e DADD Ra,Rb,Rc LD Rf,f SD Ra,a DSUB Rd,Re,Rf SD Rd,d Original code with stalls: LD Rb,b LD Rc,c DADD Ra,Rb,Rc SD Ra,a LD Re,e LD Rf,f DSUB Rd,Re,Rf SD Rd,d

Control Hazards • When a conditional branch is executed it may change the PC and, without any special measures, leads to stalling the pipeline for a number of cycles until the branch condition is known. Branch instruction IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM WB Branch successor + 2 IF ID EX MEM Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF Three clock cycles are wasted for every branch for current MIPS pipeline

Techniques to Reduce Stalls • Hardware Schemes to Reduce: • Control Hazards • Moving the calculation of the target branch earlier in the pipeline

Techniques to Reduce Stalls • Software Schemes to Reduce: • Control Hazards • Branch prediction • Example : choosing backward branches (loop) as taken and forward branches (if) as not taken • Tracing Program behaviour

(A) (B) (C)

Dynamic Branch Prediction • Builds on the premise that history matters • Observe the behavior of branches in previous instances and try to predict future branch behavior • Try to predict the outcome of a branch early on in order to avoid stalls • Branch prediction is critical for multiple issue processors • In an n-issue processor, branches will come n times faster than a single issue processor

Midterm Exam Review