1 / 42

Advanced Computer Architecture 5MD00 / 5Z032 Introduction

Advanced Computer Architecture 5MD00 / 5Z032 Introduction. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2007. Lecture overview. Performance increase Technology factors Computing classes Cost Performance measurement Benchmarks Metrics.

jknapp
Télécharger la présentation

Advanced Computer Architecture 5MD00 / 5Z032 Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer Architecture5MD00 / 5Z032Introduction Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2007

  2. Lecture overview • Performance increase • Technology factors • Computing classes • Cost • Performance measurement • Benchmarks • Metrics ACA H.Corporaal

  3. ACA H.Corporaal

  4. Where Has This Performance Improvement Come From? • Technology • More transistors per chip • Faster logic • Machine Organization/Implementation • Deeper pipelines • More instructions executed in parallel • Instruction Set Architecture • Reduced Instruction Set Computers (RISC) • Multimedia extensions • Explicit parallelism • Compiler technology • Finding more parallelism in code • Greater levels of optimization

  5. 1946: ENIAC electronic numerical integrator and calculator Floor area 140 m2 Performance multiplication of two 10-digit numbers in 2 ms 2007: High Performance microprocessor Chip area 100-300 mm2 Board area 200 cm2; improvement of 104 Performance: 64 bit multiply in O(1 ns); improvement of 106 On top architectural improvements, like ILP exploitation VLSI Developments Technology Improvement ACA H.Corporaal

  6. Technology Trends: Processor Capacity “Graduation Window” Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million P4: 42 million Moore’s Law • CMOS improvements: • Transistor density: 4x / 3 yrs • Die size: 10-25% / yr ACA H.Corporaal

  7. Memory Capacity (Single Chip DRAM) year size(Mb) cyc time 1980 0.0625 250 ns 1983 0.25 220 ns 1986 1 190 ns 1989 4 165 ns 1992 16 145 ns 1996 64 120 ns 2000 256 100 ns ACA H.Corporaal

  8. ACA H.Corporaal

  9. Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) CPU high, Memory low(“Memory Wall”) Latency Lags Bandwidth (last ~20 years) ACA H.Corporaal

  10. Technology Trends(Summary) Capacity Speed (latency) Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years ACA H.Corporaal

  11. Computer classes • Desktop • PC • PDA ? • Game computers? • Server • Embedded See fig. 1.2 which lists • price of system • price of microprocessor module • volume (in 2005) • critical design issues ACA H.Corporaal

  12. Integrated Circuits Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Final test yield: fraction of packaged dies which pass the final testing state

  13. 8” MIPS64 R20K wafer (564 dies) Drawing single-crystalSi ingot from furnace…. Then, slice into wafers and pattern it… ACA H.Corporaal

  14. Integrated Circuits Costs IC cost = Die cost+ Testing cost + Packaging cost Final test yield Die cost= Wafer cost Dies per Wafer * Die yield Die yield: fraction of good dies on a wafer

  15. Average Discount Gross Margin List Price 25% to 40% Component Cost Avg. Selling Price 34% to 39% 6% to 8% Direct Cost 15% to 33% Final Product Price • Component Costs • Direct Costs(add 25% to 40%) recurring costs: labor, purchasing, warranty • Gross Margin(add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes • Average Discountto get List Price (add 33% to 66%): volume discounts and/or retailer markup

  16. Quantitative Principles of Design • Take Advantage of Parallelism • Principle of Locality • Focus on the Common Case • Amdahl’s Law • The Performance Equation ACA H.Corporaal

  17. 1. Parallelism How to improve performance? • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue ACA H.Corporaal

  18. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Pipelined Instruction Execution ACA H.Corporaal

  19. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Limits to pipelining • Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: attempt to use the same hardware to do two different things at once • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Time (clock cycles) I n s t r. O r d e r ACA H.Corporaal

  20. 2. The Principle of Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) • Last 30 years, HW relied on locality for memory perf. MEM P $ ACA H.Corporaal

  21. Memory Hierarchy Levels Capacity Access Time Cost Staging Xfer Unit Upper Level CPU Registers 100s Bytes 300 – 500 ps (0.3-0.5 ns) Registers prog./compiler 1-8 bytes Instr. Operands faster L1 Cache L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte cache cntl 32-64 bytes Blocks L2 Cache cache cntl 64-128 bytes Blocks Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Memory OS 4K-8K bytes Pages Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte Disk user/operator Mbytes Files Larger Tape infinite sec-min ~$1 / GByte Tape Lower Level ACA H.Corporaal

  22. 3. Focus on the Common Case • In making a design trade-off, favor the frequent case over the infrequent case • E.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st • E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st • Frequent case is often simpler and can be done faster than the infrequent case • E.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow • May slow down overflow, but overall performance improved by optimizing for the normal case • What is frequent case and how much performance improved by making case faster => Amdahl’s Law ACA H.Corporaal

  23. Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced exc. time ACA H.Corporaal

  24. Amdahl’s Law • Floating point instructions improved to run 2 times faste, but only 10% of actual instructions are FP ExTimenew= Speedupoverall = ACA H.Corporaal

  25. Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew= ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95 ACA H.Corporaal

  26. 4. The performance equation • Main performance metric: Total Execution Time • Texec = Ncycles * Tcycle = Ninstructions * CPI * Tcycle • CPI: Cycles Per Instruction ACA H.Corporaal

  27. Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix ACA H.Corporaal

  28. Measurement Tools • Benchmarks, Traces, Mixes • Hardware: Cost, delay, area, power estimation • Simulation (many levels) • ISA, RT, Gate, Circuit level • Queuing Theory (analytic models) • Rules of Thumb • Fundamental “Laws”/Principles ACA H.Corporaal

  29. Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Instr. Cnt CPI Clock Rate Program Compiler Instr. Set Organization Technology ACA H.Corporaal

  30. Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X X Compiler X X Inst. Set. X X Organization X X Technology X ACA H.Corporaal

  31. Marketing Metrics • MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Not effective for machines with different instruction sets • Not effective for programs with different instruction mixes • Uncorrelated with performance • MFLOPs = FP Operations / Time * 10^6 • Machine dependent • Often not where time is spent • Peak - maximum able to achieve • Native - average for a set of benchmarks • Relative - compared to another platform • Normalized MFLOPS: • add,sub,compare,mult 1 • divide, sqrt 4 • exp, sin, . . . 8 ACA H.Corporaal

  32. Programs to Evaluate Processor Performance • (Toy) Benchmarks • 10-100 line program • e.g.: sieve, puzzle, quicksort • Synthetic Benchmarks • Attempt to match average frequencies of real workloads • e.g., Whetstone, dhrystone • Kernels • Time critical excerpts • Real Benchmarks ACA H.Corporaal

  33. Benchmarks • Benchmark mistakes • Only average behavior represented in test workload • Loading level controlled inappropriately • Caching effects ignored • Ignoring monitoring overhead • Not ensuring same initial conditions • Collecting too much data but doing too little analysis • Benchmark tricks • Compiler wired to optimize the workload • Very small benchmarks used • Benchmarks manually translated to optimize performance ACA H.Corporaal

  34. SPEC benchmarks • CPU: CPU2006 • Graphics: SPECviewperf9 e.o. • HPC/OMP: HPC2002; OMP2001, MPI2006 • Java Client/Server: jAppServer2004 • Mail Servers: MAIL2001 • Network File System: SDS97_R1 • Power (under development) • Web Servers: WEB2005 ACA H.Corporaal

  35. ACA H.Corporaal

  36. How to Summarize Performance • Arithmetic mean (weighted arithmetic mean) tracks execution time: (Ti)/n or (Wi*Ti) • Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) • But do not take the arithmetic mean of normalized execution time, but use the geometric mean: ( ratioi)1/n ACA H.Corporaal

  37. What is Ahead? • Bigger caches, and more levels of cache • Software controllable memory hierarchy • Greater instruction level parallelism • Exploiting data level parallelism • Exploiting task level parallelism: Multiple processors per chip • Bus based • Networks-on-Chip (NoC) • Complete systems on a chip: platforms

  38. Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape RAID Emerging Technologies Interleaving Bus protocols DRAM Coherence, Bandwidth, Latency Memory Hierarchy L2 Cache L1 Cache Addressing, Protection, Exception Handling VLSI Instruction Set Architecture Pipelining and Instruction Level Parallelism Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP ACA H.Corporaal

  39. Computer Architecture Topics Shared Memory, Message Passing, Data Parallelism P M P M P M P M ° ° ° Network Interfaces S Interconnection Network Processor-Memory-Switch Topologies, Routing, Bandwidth, Latency, Reliability Multiprocessors Networks and Interconnections ACA H.Corporaal

  40. AMD dual core opteron 64-bit ACA H.Corporaal

  41. ACA H.Corporaal

  42. Intel 80 processor die ACA H.Corporaal

More Related