Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project (www.cs.wisc.edu/multifacet) October 2004 Full Disclosure: Consult for Sun & US NSF

talk Executive Summary: Problem • Expect computer performance doubling every 2 years • Derives from Technology & Architecture • Technology will advance for ten or more years • But Architecture faces a Rock: Slow Memory • a.k.a. Wall [Wulf & McKee 1995] • Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless ofthe real Moore’s Law (doubling transistors)

Executive Summary: Recommendation • Chip Multiprocessing (CMP) Can Help • Implement multiple processors per chip • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) • Necessary For Restoring Popular Moore’s Law

Outline • Executive Summary • Background • Moore’s Law • Architecture • Instruction Level Parallelism • Caches • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading

talk Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … • Servers (> PCs) • Clients (= PCs) • Embedded (< PCs) • Come to expect a misnamed “Moore’s Law” • Computer performance doubles every two years (same cost) •  Progress in next two years = All past progress • Important Corollary • Computer cost halves every two years (same performance) •  In ten years, same performance for 3% (sales tax – Jim Gray) • Derives from Technology & Architecture

(Technologist’s) Moore’s Law Provides Transistors Number of transistorsper chip doubles everytwo years (18 months) Merely a “Law” of Business Psychology

Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

Time  Time   Instrns  Instrns Architects Use Transistors To Compute Faster • Bit Level Parallelism (BLP) within Instructions • Instruction Level Parallelism (ILP) among Instructions • Scores of speculative instructions look sequential!

Architects Use Transistors Tolerate Slow Memory • Cache • Small, Fast Memory • Holds information (expected)to be used soon • Mostly Successful • Apply Recursively • Level-one cache(s) • Level-two cache • Most of microprocessordie area is cache!

Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Technology Continues • Slow Memory • Implications • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading

Future Technology Implications • For (at least) ten years, Moore’s Law continues • More repeated doublings of number of transistors per chip • Faster transistors • But hard for processor architects to use • More transistors due global wire delays • Faster transistors due too much dynamic power • Moreover, hitting a Rock: Slow Memory • Memory access = 100s floating-point multiplies! • a.k.a. Wall [Wulf & McKee 1995]

Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

I1 I2 window = 4 (64) Compute Phases I3 I4 Memory Phases Time  Time   Instrns  Instrns Impact of Slow Memory (Rock) • Off-Chip Misses are now hundreds of cycles • More Realistic Case Good Case!

Implications of Slow Memory (Rock) • Increasing Memory Latency hides Compute Phase • Near Term Implications • Reduce memory latency • Fewer memory accesses • More Memory Level Parallelism (MLP) • Longer Term Implications • What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? • What can amazing speculative hardware do?

Assessment So Far • Appears • Popular Moore’s Law (doubling performance)will end soon, regardless of thereal Moore’s Law (doubling transistors) • Processor performance hitting Rock (Slow Memory) • No known way to overcome this, unless • Redefine performance in Popular Moore’s Law • From Processor Performance • To Chip Performance

Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Small & Large CMPs • CMP Systems • CMP Workload • Go to the Hard Place of Mainstream Multithreading

Performance for Chip, not Processor or Thread • Chip Multiprocessing (CMP) • Replicate Processor • Private L1 Caches • Low latency • High bandwidth • Shared L2 Cache • Larger than if private

Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing Piranha Processing Node

I$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way CPU

I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay CPU CPU CPU CPU ICS CPU CPU CPU CPU

I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU

MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU 8 banks @1.6GB/sec

MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) prog., 1K instr., even/odd interleaving HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU

MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth 4 Links @ 8GB/s HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU

MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU Single Chip

Single-Chip Piranha Performance • Piranha’s performance margin 3x for OLTP and 2.2x for DSS • Piranha has more outstanding misses  better utilizes memory system

Simultaneous Multithreading (SMT) • Multiplex S logical processors on each processor • Replicate registers, share caches, & manage other parts • Implementation factors keep S small, e.g., 2-4 • Cost-effective gain if threads available • E.g, S=2  1.4x performance • Modest cost • Limits waste if additional logical processor(s) not used • Worthwhile CMP enhancement

C M C Small CMP Systems • Use One CMP (with C cores of S-way SMT) • C=[2,16] & S=[2,4]  C*S = [4,64] • Size of a small PC! • Directly Connect CMP (C) toMemory Controller (M) or DRAM

M M C C C C C C C C M M M M M M Processor-Centric Dance Hall Medium CMP Systems • Use 2-16 CMPs (with C cores of S-way SMT) • Smaller: 2*4*4 = 32 • Larger: 16*16*4 = 1024 • In a single cabinet • Connecting CMPs & Memory Controllers/DRAM & many issues

Inflection Points • Inflection point occurs when • Smooth input change leads • Disruptive output change • Enough transistors for … • 1970s simple microprocessor • 1980s pipelined RISC • 1990s speculative out-of-order • 2000s … • CMP will be Server Inflection Point • Expect >10x performance for less cost • Implying, >>10x cost-performance • Early CMPs like old SMPs but expect dramatic advances!

So What’s Wrong with CMP Picture? • Chip Multiprocessors • Allow profitable use of more transistors • Support modest to vast multithreading • Will be inflection point for commercial servers • But • Many workloads have single thread (available to run) • Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) • Go to a Hard Place • Make most workloads flourish with CMPs

Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading • Parallel from Fringe to Center • For All of Computer Science!

Thread Parallelism from Fringe to Center • History • Automatic Computer (vs. Human)  Computer • Digital Computer (vs. Analog)  Computer • Must Change • Parallel Computer (vs. Sequential)  Computer • Parallel Algorithm (vs. Sequential)  Algorithm • Parallel Programming (vs. Sequential)  Programming • Parallel Library (vs. Sequential)  Library • Parallel X (vs. Sequential)  X • Otherwise, repeated performance doublings unlikely

Computer Architects Can Contribute • Chip Multiprocessor Design • Transcend pre-CMP multiprocessor design • Intra-CMP has lower latency & much higher bandwidth • Hide Multithreading (Helper Threads) • Assist Multithreading (Thread-Level Speculation) • Ease Multithreaded Programming (Transactions) • Provide a “Gentle Ramp to Parallelism” (Hennessy)

But All of Computer Science is Needed • Hide Multithreading (Libraries & Compilers) • Assist Multithreading (Development Environments) • Ease Multithreaded Programming (Languages) • Divide & Conquer Multithreaded Complexity(Theory & Abstractions) • Must Enable • 99% of programmers think sequentially while • 99% of instructions execute in parallel • Enable a “Parallelism Superhighway”

Summary • (Single-Threaded) Computing faces a Rock: Slow Memory • Popular Moore’s Law (doubling performance) will end soon • Chip Multiprocessing Can Help • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center • Necessary For Restoring Popular Moore’s Law

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)