1 / 68

Midterm Exam Review

Midterm Exam Review. Exam Format. We will have 5 questions in the exam One question: true/false which covers general topics. 4 other questions Either require calculation Filling pipelining tables. General Introduction Technology trends, Cost trends, and Performance evaluation. Technology.

kipling
Télécharger la présentation

Midterm Exam Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Midterm Exam Review

  2. Exam Format • We will have 5 questions in the exam • One question: true/false which covers general topics. • 4 other questions • Either require calculation • Filling pipelining tables

  3. General IntroductionTechnology trends, Cost trends, and Performance evaluation

  4. Technology Programming Languages Applications Computer Architecture: • Instruction Set Design • Organization • Hardware Operating Measurement & Evaluation History Systems Computer Architecture • Definition: Computer Architecture involves 3 inter-related components • Instruction set architecture (ISA) • Organization • Hardware

  5. Three Computing Markets • Desktop • Optimize price and performance (focus of this class) • Servers • Focus on availability, scalability, and throughput • Embedded computers • In appliances, automobiles, network devices … • Wide performance range • Real-time performance constraints • Limited memory • Low power • Low cost

  6. Trends in Technology • Trends in computer technology generally followed closely Moore’s Law “Transistor density of chips doubles every 1.5-2.0 years”. • Processor Performance • Memory/density density • Logic circuits density and speed • Memory access time and disk access time do not follow Moore’s Law, and create a big gap in processor/memory performance.

  7. MOORE’s LAW Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

  8. Trends in Cost • High volume products lowers manufacturing costs (doubling the volume decreases cost by around 10%) • The learning curve lowers the manufacturing costs – when a product is first introduced it costs a lot, then the cost declines rapidly. • Integrated circuit (IC) costs • Die cost • IC cost • Dies per wafer • Relationship between cost and price of whole computers

  9. CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Metrics for Performance • The hardware performance is one major factor for the success of a computer system. • response time (execution time) - the time between the start and completion of an event. • throughput - the total amount of work done in a period of time. • CPU time is a very good measure of performance (important to understand) (e.g., how to compare 2 processors using CPU time, CPI – How to quantify an improvement using CPU time). CPU Time = I x CPI x C

  10. CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Factors Affecting CPU Performance Instruction Count I CPI Clock Cycle C Program X X X Compiler X Instruction Set Architecture (ISA) X X X X Organization X Technology

  11. Using Benchmarks to Evaluate and Compare the Performance of Different Processors The most popular and industry-standard set of CPU benchmarks. • SPEC CPU2006: • CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs) • Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of SPECint2000 = SPECfp2000 = 100 • How to summarize performance • Arithmetic mean • Weighted arithmetic mean • Geometric mean (this is what the industry uses)

  12. Other measures of performance • MIPS • MFLOPS • Amdhal’s law: Suppose that enhancement E accelerates a fraction F of the execution time (NOT Frequency) by a factor S and the remainder of the time is unaffected then (Important to understand): Execution Time with E = ((1-F) + F/S) X Execution Time without E 1 Speedup (E) = ---------------------- (1 - F) +F/S

  13. Instruction Set Architectures

  14. Instruction Set Architecture (ISA) software instruction set hardware

  15. f1 f2 f5 f3 f4 i p s q j fp f3 The Big Picture SPEC Requirements Problem Focus Algorithms f2() { f3(s2, &j, &i); *s2->p = 10; i = *s2->q + i; } Prog. Lang./OS i1: ld r1, b <p1> i2: ld r2, c <p1> i3: ld r5, z <p3> i4: mul r6, r5, 3 <p3> i5: add r3, r1, r2 <p1> ISA uArch Performance Focus Circuit Device

  16. Classifying ISA • Memory-memory architecture • Simple compilers • Reduced number of instructions for programs • Slower in performance (processor-memory bottleneck) • Memory-register architecture • In between the two. • Register-register architecture (load-store) • Complicated compilers • Higher memory requirements for programs • Better performance (e.g., more efficient pipelining)

  17. Memory addressing & Instruction operations • Addressing modes • Many addressing modes exit • Only few are frequently used (Register direct, Displacement, Immediate, Register Indirect addressing) • We should adopt only the frequently used ones • Many opcodes (operations) have been proposed and used • Only few (around 10) are frequently used through measurements

  18. RISC vs. CISC • Now there is not much difference between CISC and RISC in terms of instructions • The key difference is that RISC has fixed-length instructions and CISC has variable length instructions • In fact, internally the Pentium/AMD have RISC cores.

  19. 32-bit vs. 64-bit processors • The only difference is that 64-bit processors have registers of size 64 bits, and have a memory address of 64 bits wide. So accessing memory may be faster. • Their instruction length is independent from whether they are 64-bit or 32-bit processors • They can access 64 bits from memory in one clock cycle

  20. Pipelining

  21. Computer Pipelining • Pipelining is an implementation technique where multiple operations on a number of instructions are overlapped in execution. • An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. • Each step is called a pipe stage or a pipe segment. • Throughput of an instruction pipeline is determined by how often an instruction exists the pipeline. • The time to move an instruction one step down the line is equal to the machine cycle and is determined by the stage with the longest processing delay.

  22. Pipelining: Design Goals • An important pipeline design consideration is to balance the length of each pipeline stage. • Pipelining doesn’t help latency of single instruction, but it helps throughput of entire program • Pipeline rate is limited by the slowest pipeline stage • Under these ideal conditions: • Speedup from pipelining equals the number of pipeline stages • One instruction is completed every cycle, CPI = 1.

  23. A 5-stage Pipelined MIPS Datapath

  24. Consider the following instruction sequence: lw $r0, 10($r1) sw $sr3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10 Pipelined Example - Executing Multiple Instructions

  25. Executing Multiple InstructionsClock Cycle 1 LW

  26. Executing Multiple InstructionsClock Cycle 2 SW LW

  27. Executing Multiple InstructionsClock Cycle 3 ADD SW LW

  28. Executing Multiple InstructionsClock Cycle 4 ADD SW LW SUB

  29. Executing Multiple InstructionsClock Cycle 5 SUB ADD SW LW

  30. Executing Multiple InstructionsClock Cycle 6 SUB ADD SW

  31. Executing Multiple InstructionsClock Cycle 7 ADD SUB

  32. Executing Multiple InstructionsClock Cycle 8 SUB

  33. Processor Pipelining • There are two ways that pipelining can help: • Reduce the clock cycle time, and keep the same CPI • Reduce the CPI, and keep the same clock cycle time CPU time = Instruction count * CPI * Clock cycle time

  34. Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X Hz

  35. Pipeline Registers ADD ADD M U X M E U X X T N D Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X*5 Hz 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 ADDR Data RD Memory 16 32 WD

  36. Reduce the CPI, and keep the same cycle time CPI = 5 Clock = X*5 Hz

  37. Pipeline Registers ADD ADD 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 M ADDR U X Data RD Memory M E U X 16 32 X WD T N D Reduce the CPI, and keep the same cycle time CPI = 1 Clock = X*5 Hz

  38. IF ID EX MEM1 MEM2 WB IF ID EX MEM WB 5 ns 5 ns 4 ns 4 ns 5 ns 5 ns 5 ns 10 ns 5 ns 4 ns 4 ns Pipelining: Performance • We looked at the performance (speedup, latency, CPI) of pipelined under many settings • Unbalanced stages • Different number of stages • Additional pipelining overhead

  39. Pipelining is Not That Easy for Computers • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards • Data hazards • Control hazards • A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline • We looked at the performance of pipelines with hazards

  40. Reg ALU Reg Techniques to Reduce Stalls • Structural hazards • Memory: Separate instruction and data memory • Registers: Write 1st half of cycle and read 2nd half of cycle Mem Mem

  41. Data Hazard Classification • Different Types of Hazards (We need to know) • RAW (read after write) • WAW (write after write) • WAR (write after read) • RAR (read after read): Not a hazard. • RAW will always happen (true dependence) in any pipeline • WAW and WAR can happen in certain pipelines • Sometimes it can be avoided using register renaming

  42. Zero? MUX MUX D/A Buffer A/M Buffer M/W Buffer ALU Data Memory Techniques to Reduce data hazards • Hardware Schemes to Reduce: • Data Hazards • Forwarding

  43. Pipeline with Forwarding: Could avoid stalls A set of instructions that depend on the DADD result uses forwarding paths to avoid the data hazard

  44. Stall Stall Techniques to Reduce Stalls • Software Schemes to Reduce: • Data Hazards • Compiler Scheduling: reduce load stalls Scheduled code with no stalls: LD Rb,b LD Rc,c LD Re,e DADD Ra,Rb,Rc LD Rf,f SD Ra,a DSUB Rd,Re,Rf SD Rd,d Original code with stalls: LD Rb,b LD Rc,c DADD Ra,Rb,Rc SD Ra,a LD Re,e LD Rf,f DSUB Rd,Re,Rf SD Rd,d

  45. Control Hazards • When a conditional branch is executed it may change the PC and, without any special measures, leads to stalling the pipeline for a number of cycles until the branch condition is known. Branch instruction IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM WB Branch successor + 2 IF ID EX MEM Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF Three clock cycles are wasted for every branch for current MIPS pipeline

  46. Techniques to Reduce Stalls • Hardware Schemes to Reduce: • Control Hazards • Moving the calculation of the target branch earlier in the pipeline

  47. Techniques to Reduce Stalls • Software Schemes to Reduce: • Control Hazards • Branch prediction • Example : choosing backward branches (loop) as taken and forward branches (if) as not taken • Tracing Program behaviour

  48. (A) (B) (C)

  49. Dynamic Branch Prediction • Builds on the premise that history matters • Observe the behavior of branches in previous instances and try to predict future branch behavior • Try to predict the outcome of a branch early on in order to avoid stalls • Branch prediction is critical for multiple issue processors • In an n-issue processor, branches will come n times faster than a single issue processor

More Related