Multicore Computing - Evolution

Multicore Computing- Evolution

Performance Scaling Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086 Source: Shekhar Borkar, Intel Corp.

Intel • Homogeneous cores • Bus based on chip interconnect • Shared Memory • Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Source: Intel Corp. Large, shared set associative, prefetch, etc.

IBM Cell Processor Heterogeneous MultiCore High speed I/O High bandwidth, multiple buses Source: IBM Classic (stripped down) core Co-processor accelerator

Embedded processor Custom cores On-Chip I/O Source: AMD On-Chip Buses AMD Au1200 System on Chip

PlayStation 2 Die Photo (SoC) Floating point MACs Source: IEEE Micro, March/April 2000

Multi-* is Happening Source: Intel Corp.

Intel’s Roadmap for Multicore Mobile processors Enterprise processors Desktop processors 8C 12MB shared (45nm) 8C 12MB shared (45nm) QC 8/16MB shared DC 3MB /6MB shared (45nm) DC 3 MB/6 MB shared (45nm) QC 4MB DC 4MB DC 2/4MB shared DC 16MB DC 2/4MB shared DC 2MB DC 4MB SC 1MB DC 2MB DC 2/4MB SC 512KB/ 1/ 2MB 2006 2007 2008 2006 2007 2008 2006 2007 2008 Source: Adapted from Tom’s Hardware • Drivers are • Market segments • More cache • More cores

Distillation Into Trends • Technology Trends • What can we expect/project? • Architecture Trends • What are the feasible outcomes? • Application Trends • What are the driving deployment scenarios? • Where are the volumes?

GATE DRAIN SOURCE BODY Technology Scaling GATE DRAIN SOURCE • 30% scaling down in dimensions  doubles transistor density • Power per transistor • Vdd scaling  lower power • Transistor delay = Cgate Vdd/ISAT • Cgate, Vdd scaling  lower delay tox L

Fundamental Trends Source: Shekhar Borkar, Intel Corp.

Moore’s Law • How do we use the increasing number of transistors? • What are the challenges that must be addressed? Source: Intel Corp.

Impact of Moore’s Law To Date Increase Frequency Deeper Pipelines IBM Power5 Source: IBM Source: IBM Increase ILP Concurrent Threads, Branch Prediction and SMT Push the Memory Wall  Larger caches Manage Power clock gating, activity minimization

Shaping Future Multicore Architectures • The ILP Wall • Limited ILP in applications • The Frequency Wall • Not much headroom • The Power Wall • Dynamic and static power dissipation • The Memory Wall • Gap between compute bandwidth and memory bandwidth • Manufacturing • Non recurring engineering costs • Time to market

The Frequency Wall • Not much headroom left in the stage to stage times (currently 8-12 FO4 delays) • Increasing frequency leads to the power wall Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000

Options • Increase performance via parallelism • On chip this has been largely at the instruction/data level • The 1990’s through 2005 was the era of instruction level parallelism • Single instruction multiple data/Vector parallelism • MMX, SSIMD, Vector Co-Processors • Out Of Order (OOO) execution cores • Explicitly Parallel Instruction Computing (EPIC) • Have we exhausted options in a thread?

The ILP Wall - Past the Knee of the Curve? Made sense to go Superscalar/OOO: good ROI Performance Very little gain for substantial effort Scalar In-Order “Effort” Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Source: G. Loh

The ILP Wall • Limiting phenomena for ILP extraction: • Clock rate: at the wall each increase in clock rate has a corresponding CPI increase (branches, other hazards) • Instruction fetch and decode: at the wall more instructions cannot be fetched and decoded per clock cycle • Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth • ILP in applications: serial fraction on applications • Reality: • Limit studies cap IPC at 100-400 (using ideal processor) • Current processors have IPC of only 1-2

The ILP Wall: Options • Increase granularity of parallelism • Simultaneous Multi-threading to exploit TLP • TLP has to exist  otherwise poor utilization results • Coarse grain multithreading • Throughput computing • New languages/applications • Data intensive computing in the enterprise • Media rich applications

The Memory Wall µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) 10 DRAM 7%/yr. DRAM 1 Time

Year? The Memory Wall • Increasing the number of cores increases the demanded memory bandwidth • What architectural techniques can meet this demand? Average access time

CPU0 CPU1 The Memory Wall • On die caches are both area intensive and power intensive • StrongArm dissipates more than 43% power in caches • Caches incur huge area costs • Larger caches never deliver the near-universal performance boost offered by frequency ramping (Source: Intel) AMD Dual-Core Athlon FX IBM Power5

The Power Wall • Power per transistor scales with frequency but also scales with Vdd • Lower Vdd can be compensated for with increased pipelining to keep throughput constant • Power per transistor is not same as power per area  power density is the problem! • Multiple units can be run at lower frequencies to keep throughput constant, while saving power

Leakage Power Basics • Sub-threshold leakage • Increases with lower Vth , T, W • Gate-oxide leakage • Increases with lower Tox, higher W • High K dielectrics offer a potential solution • Reverse biased pn junction leakage • Very sensitive to T, V (in addition to diffusion area)

10000 1000 Rocket Nozzle 100 Nuclear Power Density (W/cm2) Reactor 8086 10 4004 P6 8008 Pentium® 8085 386 286 486 8080 1 1970 1980 1990 2000 2010 Year The Current Power Trend Sun’s Surface Hot Plate Source: Intel Corp.

Improving Power/Performance • Consider constant die size and decreasing core area each generation = more cores/chip • Effect of lowering voltage and frequency  power reduction • Increasing cores/chip  performance increase Better power performance!

Accelerators TCP/IP Offload Engine 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines, Speech engines Source: Shekhar Borkar, Intel Corp.

Low-Power Design Techniques • Circuit and gate level methods • Voltage scaling • Transistor sizing • Glitch suppression • Pass-transistor logic • Pseudo-nMOS logic • Multi-threshold gates • Functional and architectural methods • Clock gating • Clock frequency reduction • Supply voltage reduction • Power down/off • Algorithmic and software techniques Two decades worth of research and development!

The Economics of Manufacturing • Where are the costs of developing the next generation processors? • Design Costs • Manufacturing Costs • What type of chip level solutions is the economics implying? • Assessing the implications of Moore’s Law is an exercise in mass production

C P The Cost of An ASIC • Cost and Risk rising to unacceptable levels • Top cost drivers • Verification (40%) • Architecture Design (23%) • Embedded Software Design • 1400 man months (SW) • 1150 man months (HW) • HW/SW integration Estimated Cost - $85 M -$90 M Example: Design with 80 M transistors in 100 nm technology 12 – 18 months implementation verification verification production prototype verification design *Handel H. Jones, “How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com

The Spectrum of Architectures Customization fullyin Software Customization fully in Hardware Design NRE Effort Decreasing Customization Increasing NRE and Time to Market Software Development Hardware Development Compilation Synthesis Structured ASIC Polymorphic Computing Architectures Custom ASIC Fixed + Variable ISA FPGA Microprocessor Tiled architectures Xilinx Altera LSI Logic Leopard Logic MONARCHSM,RAW, TRIPS PACT, PICOChip Tensilica Stretch Inc.

Interlocking Trade-offs Memory ILP bandwidth miss penalty dynamic penalties leakage power speculation Power Frequency dynamic power

Multi-core Architecture Drivers • Addressing ILP limits • Multiple threads • Coarse grain parallelism  raise the level of abstraction • Addressing Frequency and Power limits • Multiple slower cores across technology generation • Scaling via increasing the number of cores rather than frequency • Heterogeneous cores for improved power/performance • Addressing memory system limits • Deep, distributed, cache hierarchies • OS replication  shared memory remains dominant • Addressing manufacturing issues • Design and verification costs  Replication  the network becomes more important!

Parallelism

parallelizable 1CPU 2CPUs 3CPUs 4CPUs Beyond ILP • Performance is limited by the serial fraction • Coarse grain parallelism in the post ILP era • Thread, process and data parallelism • Learn from the lessons of the parallel processing community • Revisit the classifications and architectural techniques

Flynn’s Model • Flynn’s Classification • Single instruction stream, single data stream (SISD) • The conventional, word-sequential architecture including pipelined computers • Single instruction stream, multiple data stream (SIMD) • The multiple ALU-type architectures (e.g., array processor) • Multiple instruction stream, single data stream (MISD) • Not very common • Multiple instruction stream, multiple data stream (MIMD) • The traditional multiprocessor system M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.

SIMD/Vector Computation IBM Cell SPE Organization • SIMD and Vector models are spatial and temporal analogs of each other • A rich architectural history dating back to 1953! IBM Cell SPE pipeline diagram Source: IBM Source: Cray

SIMD/Vector Architectures • VIRAM - Vector IRAM • Logic is slow in DRAM process • put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM Source: Berkeley Vector IRAM

MIMD Machines P + C P + C P + C P + C • Parallel processing has catalyzed the development of a several generations of parallel processing machines • Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers Dir Dir Dir Dir Memory Memory Memory Memory Interconnection Network

Basic Models for Parallel Programs • Shared Memory • Coherency/consistency are driving concerns • Programming model is simplified at the expense of system complexity • Message Passing • Typically implemented on distributed memory machines • System complexity is simplified at the expense of increased effort by the programmer

Shared Memory Model • That’s basically it… • need to fork/join threads, synchronize (typically locks) Main Memory Write X Read X CPU0 CPU1

Message Passing Protocols • Explicitly send data from one thread to another • need to track ID’s of other CPUs • broadcast may need multiple send’s • each CPU has own memory space • Hardware: send/recv queues between CPUs Send Recv CPU0 CPU1

Shared Memory Vs. Message Passing • Shared memory doesn’t scale as well to larger number of nodes • communications are broadcast based • bus becomes a severe bottleneck • Message passing doesn’t need centralized bus • can arrange multi-processor like a graph • nodes = CPUs, edges = independent links/routes • can have multiple communications/messages in transit at the same time

Two Emerging Challenges Programming Models and Compilers? Source: Intel Corp. Source: IBM Interconnection Networks

Multicore Computing - Evolution