1 / 34

Topics 8: Advance in Parallel Computer Architectures

Topics 8: Advance in Parallel Computer Architectures. Reading List. Slides: Topic8x. Why Study Parallel Architecture ?. Role of a computer architect :

studs
Télécharger la présentation

Topics 8: Advance in Parallel Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topics 8: Advance in Parallel Computer Architectures \course\cpeg323-05F\Topic-final-323.ppt

  2. Reading List • Slides: Topic8x \course\cpeg323-05F\Topic-final-323.ppt

  3. Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximizeperformanceand programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing \course\cpeg323-05F\Topic-final-323.ppt

  4. Inevitability of Parallel Computing • Application demands • Technology Trends • Architecture Trends • Economics \course\cpeg323-05F\Topic-final-323.ppt

  5. Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Range of performance demands • Goal of applications in using parallel machines: Speedup • Productivity requirement \course\cpeg323-05F\Topic-final-323.ppt

  6. Summary of Application Trends • Transition to parallel computing has occurred for scientific and engineering computing • In rapid progress in commercial computing • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads • Demand on productivity \course\cpeg323-05F\Topic-final-323.ppt

  7. Proc $ Interconnect Technology: A Closer Look • Basic advance is decreasing feature size ( ) • Clock rate improves roughly proportional to improvement in  • Number of transistors improves like (or faster) • Performance > 100x per decade; clock rate 10x, rest transistor count • How to use more transistors? • Parallelism in processing • Locality in data access • Both need resources, so tradeoff \course\cpeg323-05F\Topic-final-323.ppt

  8. Clock Frequency Growth Rate • 30% per year \course\cpeg323-05F\Topic-final-323.ppt

  9. Transistor Count Growth Rate • 1 billion transistors on chip in early 2000’s A.D. • Transistor count grows much faster than clock rate • - 40% per year, order of magnitude more contribution in 2 decades \course\cpeg323-05F\Topic-final-323.ppt

  10. Similar Story for Storage • Divergence between memory capacity and speed more pronounced • Larger memories are slower • Need deeper cache hierarchies • Parallelism and locality within memory systems • Disks too: Parallel disks plus caching \course\cpeg323-05F\Topic-final-323.ppt

  11. Moore’s Law and Headcount • Along with the number of transistors, the effort and headcount required to design a microprocessor has grown exponentially \course\cpeg323-05F\Topic-final-323.ppt

  12. Architectural Trends • Architecture: performance and capability • Tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Understanding microprocessor architectural trends • Four generations of architectural history: tube, transistor, IC, VLSI \course\cpeg323-05F\Topic-final-323.ppt

  13. Technology Progress Overview • Processor speed improvement: 2x per year (since 85). 100x in last decade. • DRAM Memory Capacity: 2x in 2 years (since 96). 64x in last decade. • DISK capacity: 2x per year (since 97). 250x in last decade. \course\cpeg323-05F\Topic-final-323.ppt

  14. Classes of Parallel Architecture forHigh Performance Computers (Courtesy of Thomas Sterling) • Parallel Vector Processors (PVP) • NEC Earth Simulator, SX-6 • Cray- 1, 2, XMP, YMP, C90, T90, X1 • Fujitsu 5000 series • Massively Parallel Processors (MPP) • Intel Touchstone Delta & Paragon • TMC CM-5 • IBM SP-2 & 3, Blue Gene/Light • Cray T3D, T3E, Red Storm/Strider • Distributed Shared Memory (DSM) • SGI Origin • HP Superdome • Single Instruction stream Single Data stream (SIMD) • Goodyear MPP, MasPar 1 & 2, TMC CM-2 • Commodity Clusters • Beowulf-class PC/Linux clusters • Constellations • HP Compaq SC, Linux NetworX MCR \course\cpeg323-05F\Topic-final-323.ppt

  15. What we have learned in the last two decade? Building a “good” general-purpose parallel machine is very hard! Proof by contradiction: so many companies went bankrupt in the past decade! \course\cpeg323-05F\Topic-final-323.ppt

  16. 1 103 106 109 1012 1015 One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS A Growth-Factor of a Billion in Performance in a Single Lifetime(Courtesy to Thomas Sterling) 1959 IBM 7094 1976 Cray 1 1991 Intel Delta 1996 T3E 2003 Cray X1 1949 Edsac 1823 Babbage Difference Engine 2001 Earth Simulator 1951 Univac 1 1964 CDC 6600 1982 Cray XMP 1988 Cray YMP 1997 ASCI Red 1943 Harvard Mark 1 \course\cpeg323-05F\Topic-final-323.ppt

  17. System Performance Applications No schedule provided by source Plasma Fusion Simulation [Jardin 03] 1 Zettaflops Full Global Climate [Malone 03] 100 Exaflops  Geodata Earth  Station Range [NASA 02] 10 Exaflops 1 Exaflops Compute as fast as the engineer can think[NASA 99] protein folding 100 Petaflops simulation of large biomolecular structures (ms scale) 10 Petaflops 1 Petaflops simulation of medium biomolecular structures (us scale)  1001000 [SCaLeS 03] 100 Teraflops 2000 2010 2020 Applications Demands [Courtesy ofErik P. DeBenedictis 2004] Simulation of more complex biomolecular structures [HEC04] 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet.[Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report.[NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!”NASA/TM-1999-209715, available on Internet.[NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet.[SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/.[DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published asSandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. \course\cpeg323-05F\Topic-final-323.ppt

  18. Multi-core Technology Is Becoming Mainstream • IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed • Unprecedented peak performance • Significantly reduces hardware cost with much lower power consumption and heat • Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) \course\cpeg323-05F\Topic-final-323.ppt

  19. IBM Power5 Multicore Chip • Technology: 130nm lithography, Cu, SOI • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • 24% area growth per core for SMT • Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group \course\cpeg323-05F\Topic-final-323.ppt

  20. 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA 8-G DRAM 8-G DRAM 8-G DRAM Quad AMD Opteron™ 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management SPI 3.0 interface 100 BaseT Management LAN USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC \course\cpeg323-05F\Topic-final-323.ppt

  21. ARM MPCore Architecture \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

  22. ClearSpeed CSX600 • 250 MHz clock • 96 high-performance processing elements • 576 Kbytes PE memory • 128 Kbytes on-chip scratchpad memory • 25,000 MIPS • 50 GFLOPS single or double precision • 3.2 Gbytes/s external memory bandwidth • 96 Gbytes/s internal memory bandwidth • 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ \course\cpeg323-05F\Topic-final-323.ppt

  23. System Performance Applications No schedule provided by source Plasma Fusion Simulation [Jardin 03] 1 Zettaflops Full Global Climate [Malone 03] 100 Exaflops  Geodata Earth  Station Range [NASA 02] 10 Exaflops 1 Exaflops Compute as fast as the engineer can think[NASA 99] protein folding 100 Petaflops simulation of large biomolecular structures (ms scale) 10 Petaflops 1 Petaflops simulation of medium biomolecular structures (us scale)  1001000 [SCaLeS 03] 100 Teraflops 2000 2010 2020 Applications Demands [Courtesy ofErik P. DeBenedictis 2004] Simulation of more complex biomolecular structures [HEC04] 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet.[Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report.[NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!”NASA/TM-1999-209715, available on Internet.[NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet.[SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/.[DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published asSandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. \course\cpeg323-05F\Topic-final-323.ppt

  24. Multi-core Technology Is Becoming Mainstream • IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed • Unprecedented peak performance • Significantly reduces hardware cost with much lower power consumption and heat • Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) \course\cpeg323-05F\Topic-final-323.ppt

  25. IBM Power5 Multicore Chip • Technology: 130nm lithography, Cu, SOI • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • 24% area growth per core for SMT • Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group \course\cpeg323-05F\Topic-final-323.ppt

  26. 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA 8-G DRAM 8-G DRAM 8-G DRAM Quad AMD Opteron™ 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management SPI 3.0 interface 100 BaseT Management LAN USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC \course\cpeg323-05F\Topic-final-323.ppt

  27. ARM MPCore Architecture \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

  28. ClearSpeed CSX600 • 250 MHz clock • 96 high-performance processing elements • 576 Kbytes PE memory • 128 Kbytes on-chip scratchpad memory • 25,000 MIPS • 50 GFLOPS single or double precision • 3.2 Gbytes/s external memory bandwidth • 96 Gbytes/s internal memory bandwidth • 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ \course\cpeg323-05F\Topic-final-323.ppt

  29. A Case Study -- The IBM Cyclops-64 Architecture “System” 1.1Pflops/, 13.5TB Memory “Rack” 15.4Tflop/s 192GB Memory Communication Ports for 3D Mesh Inter-Chip Network External Memory Input Output Chip Thread Unit SRAM Intra-Chip Network FPU SRAM Thread Unit “Board” 320Gflop/s 4GB Memory Processor “Processor” 1Gflop/s 64 KB SRAM I-Cache “Chip” 80Gflop/s 1GB Memory Bisection BW: 4TB/s Architect: Monty Denneau \course\cpeg323-05F\Topic-final-323.ppt

  30. Data Points of a 1 Petaflop C64 Machine • Cyclops Chip: 533 MHz, 5.1 MB SRAM, 1-2GB DRAM • Disk space: 300GB/node • Total system power: 2 MW (chill-water cooling) • Size: 20’ x 48’ • Mean time to failure: 2 weeks • Cost: 20 million ? \course\cpeg323-05F\Topic-final-323.ppt

  31. A Cyclops-64 Rack \course\cpeg323-05F\Topic-final-323.ppt

  32. C-64 Chip Architecture On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbours = 48GB/sec \course\cpeg323-05F\Topic-final-323.ppt

  33. Mrs.Clops \course\cpeg323-05F\Topic-final-323.ppt

  34. Summary \course\cpeg323-05F\Topic-final-323.ppt

More Related