Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World

Multi-Core Processor Technology:Maximizing CPU Performance in aPower-Constrained World Paul Teich Business StrategyCPG Server/Workstation AMD paul.teich @ amd.com

The Issues • Silicon designers can choose a variety of methods to increase processor performance • Commercial end-customers are demanding • More capable systems with more capable processors • That new systems stay within their existing power/thermal infrastructure • Processor frequency and power consumption seem to be scaling in lockstep • How can the industry-standard PC and Server industries stay on our historic performance curve without burning a hole in our motherboards? • This session is not about process technology

Session Outline • Definition: What is a processor? • Core Design • System Architecture • Manufacturing, Power, and Thermals • Multi-Core Processor Architecture • Performance Impacts

What is a Processor? • A single chip package that fits in a socket • ≥1 core (not much point in <1 core…) • Cores can have functional units, cache, etc.associated with them, just as today • Cores can be fast or slow, just as today • Shared resources • More cache • Other integration: Northbridge, memory controllers, high-speed serial links, etc. • One system interface no matter how many cores • Number of signal pins doesn’t scale with number of cores

A Representative Multi-Core Processor • Dual-core AMD Opteron™ processor is 199mm2in 90nm • Single-core AMD Opteron processor is 193mm2in 130nm

Multi-Core Processor Architecture

Core Design L1 Icache 64KB Scan/Align/Decode Microcode Engine Fastpath L1 Dcache 64KB FP Decode & Rename 36-entry FP scheduler • Frequency • Is only as good as the rest of the core architecture Fetch Branch Prediction AMD Opteron processor core architecture µops Instruction Control Unit (72 entries) Int Decode & Rename 44-entry Load/Store Queue Res Res Res AGU AGU AGU FADD FMUL FMISC ALU ALU ALU MULT

Core Design • Functional units • Superscalar is known territory • Diminishing returns for adding more functional blocks • Alternatives like VLIW have been considered and rejected by the market • Single-threaded architectural performance is pegged • Data paths • Increasing bandwidth between functional units in a core makes a difference • Such as comprehensive 64-bit design, but then where to?

Core Design • Pipeline • Deeper pipeline buys frequency at expense of increased cache miss penalty and lower instructions per clock • Shallow pipeline gives better instructions per clock at the expense of frequency scaling • Max frequency per core requires deeper pipelines • Industry converging on middle ground…9 to 11 stages • Successful RISC CPUs are in the same range • Cache • Cache size buys performance at expense of die size, it’s a direct hit to manufacturing cost • Deep pipeline cache miss penalties are reduced by larger caches • Not always the best match for shallow pipeline cores, as cache misses penalties are not as steep

Manufacturing • Moore’s Law isn’t dead, more transistors for everyone! • But…it doesn’t really mention scaling transistor power • Chemistry and physics at nano-scale • Stretching materials science • Voltage doesn’t scale yet • Transistor leakage current is increasing • As manufacturing economies and frequency increase, power consumption is increasing disproportionately • There are no process or architectural quick-fixes

Transistors Are Not Free • The number of transistors in a core determines basic power consumption • Architectural efficiency matters a lot when designing new cores • More functional units means more transistors • Deeper pipelines mean more transistors • Larger caches mean more transistors

Static Current vs. Frequency Very High Leakage and Power Embedded Parts Non-linear as processors approach max frequency 15 Static Current Fast, High Power Fast, Low Power 0 Frequency 1.0 1.5

Power vs. Frequency In AMD’s process, for 200MHz frequency steps, two steps back on frequency cuts power consumption by ~40% from maximum frequency (Gross relative numbers summarized from a mountain of real data)

Thermal Density Decreases • Hot spots • Twice as many as in single-core • Farther apart than in single-core • With freq delta, cooler than in single-core • ΘCA same for single-core at n and dual-core at n-2 • Larger die spreads heat more evenly in package • Use identical heat sink, slightly better cooling with dual-core • Works for this processor generation and next, ΘCA changes over major generations • Thermal diode accuracy becomes an issue with dual-core

Total Effect on Dual-Core Frequencies • Substantially lower power with lower frequency • Thermals easier to handle at any frequency • Result is dual-core running at n-2 in same thermal envelope as single-core running at top speed

Multi-Core Processor Architecture • Why integrate? • Most functions are really small compared to the cores and cache • All integrated logic runs at core frequency regardless of I/O speeds • What to integrate? • Northbridge crossbar switch is key • Look for innovation and differentiation in how cores areconnected on-chip • Must integrate Northbridge to integrate anything else… • Memory controller to reduce memory latency and further reduce the need for cache • High-speed serial links for system I/O • What not to integrate? • Most Southbridge functions • Graphics

AMD Opteron Processor Integrated Northbridge CPU 0Data CPU 1Data CPU 0Probes CPU 1Probes CPU 0Requests CPU 1Requests CPU 0Int CPU 1Int SystemRequest Interface(SRI) AdvancedProgrammableInterruptController(APIC) 64-bit Data Crossbar(XBAR) MemoryController(MCT) DRAMController(DCT) 64-bit Command/Address 16-bit Data/Command/Address HyperTransport™ Link 0 HyperTransport Link 1 HyperTransport Link 2 RAS/CAS/Cntl DRAM Data

Multi-Core: Where Processor and System Collide • Scales performance • Dedicated resources for two simultaneous threads • Multiple cores will contend for memory and I/O bandwidth • Northbridge is the bottleneck • Integrating Northbridge eliminates much of bottleneck • Northbridge architecture has significant impact on performance • Cores, cache and Northbridge must be balanced for optimal performance • More aggregate performance for: • Multi-threaded apps • Transactions: many instances of same app • Multi-tasking • Thread scheduling handled by OS • BIOS notifies Windows of thread execution resources

Early Benchmark Estimates • Decoder • 2P/2C – 2 proc. single-core • 4P/4C – 4 proc. single-core • 2P/4C – 2 proc. dual-core • 4P/8C – 4 proc. dual-core • Frequencies • Single-core = 2.4GHz • Dual-core = 2.0GHz • Identical system configs • Memory, disks, network, etc. • Early dual-core validation system used, different motherboards SPEC and the benchmark name SPECint are registered trademarks of the Standard Performance Evaluation Corporation. SPEC scores for AMD Opteron Model 270 and 870 based systems are estimated

Call to Action • Most application software doesn’t need to do anything to benefit from dual-core • Be aware that, for a processor within a given power envelope • Fewer cores will clock faster than more cores • Single-threaded performance-sensitive applications • More cores will out-perform fewer cores for • Multi-threaded applications • Multi-tasking response times • Transaction processing • Processor architecture impacts multi-core performance • Process technology is only the ante • Integration enables a balanced high-performance architecture

Community Resources • Windows Hardware & Driver Central (WHDC) • www.microsoft.com/whdc/default.mspx • Technical Communities • www.microsoft.com/communities/products/default.mspx • Non-Microsoft Community Sites • www.microsoft.com/communities/related/default.mspx • Microsoft Public Newsgroups • www.microsoft.com/communities/newsgroups • Technical Chats and Webcasts • www.microsoft.com/communities/chats/default.mspx • www.microsoft.com/webcasts • Microsoft Blogs • www.microsoft.com/communities/blogs

Additional Resources • Email: paul.teich @ amd.com • WinHEC Presentations • “x86 Everywhere,” Chris Herring, AMD • “Maximizing Desktop Application Performance onDual-Core PC Platforms,” Rich Brunner, AMD • Web Resources • AMD http://www.amd.com/ • AMD Multi-Core http://www.amd.com/multicore/ • AMD Opteron™ Processor http://www.amd.com/opteron/ • AMD Multi-Core White Paper http://enterprise.amd.com/downloadables/33211A_Multi-Core_WP.pdf • HyperTransport™ Consortium http://www.hypertransport.org/

Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World

Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World

Presentation Transcript

ESI 4313 Operations Research 2

High Performance Power Plants

Multi-Year Tariffs

Smart Cards Operating Systems

Introduction to Microelectronic Circuits EE40 Lecture 1 Josh Hug

The Writer Fusion Portable Word Processor

Cache

Maximizing Network and Storage Performance for Big Data Analytics

U.S. History E.O.C.T Part IV U.S. Establishment as a World Power

Maximizing Network and Storage Performance for Big Data Analytics

Memory Systems in the Multi-Core Era Lecture 1: DRAM Basics and DRAM Scaling

World War III

Lecture 3 (Complexities of Parallelism)

Applications of Genetic Algorithms to Resource-Constrained Scheduling Tasks

Multi Cycle CPU

PDH/PE

OpenMP

U.S. History E.O.C.T Part IV U.S. Establishment as a World Power

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

A Survey on Power Management Solutions for Individual Systems and Cloud

CS4100: 計算機結構 Computer Abstractions and Technology