1 / 36

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading). Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project ( www.cs.wisc.edu/multifacet ) October 2004.

scot
Télécharger la présentation

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project (www.cs.wisc.edu/multifacet) October 2004 Full Disclosure: Consult for Sun & US NSF

  2. talk Executive Summary: Problem • Expect computer performance doubling every 2 years • Derives from Technology & Architecture • Technology will advance for ten or more years • But Architecture faces a Rock: Slow Memory • a.k.a. Wall [Wulf & McKee 1995] • Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless ofthe real Moore’s Law (doubling transistors)

  3. Executive Summary: Recommendation • Chip Multiprocessing (CMP) Can Help • Implement multiple processors per chip • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) • Necessary For Restoring Popular Moore’s Law

  4. Outline • Executive Summary • Background • Moore’s Law • Architecture • Instruction Level Parallelism • Caches • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading

  5. talk Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … • Servers (> PCs) • Clients (= PCs) • Embedded (< PCs) • Come to expect a misnamed “Moore’s Law” • Computer performance doubles every two years (same cost) •  Progress in next two years = All past progress • Important Corollary • Computer cost halves every two years (same performance) •  In ten years, same performance for 3% (sales tax – Jim Gray) • Derives from Technology & Architecture

  6. (Technologist’s) Moore’s Law Provides Transistors Number of transistorsper chip doubles everytwo years (18 months) Merely a “Law” of Business Psychology

  7. Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

  8. Time  Time   Instrns  Instrns Architects Use Transistors To Compute Faster • Bit Level Parallelism (BLP) within Instructions • Instruction Level Parallelism (ILP) among Instructions • Scores of speculative instructions look sequential!

  9. Architects Use Transistors Tolerate Slow Memory • Cache • Small, Fast Memory • Holds information (expected)to be used soon • Mostly Successful • Apply Recursively • Level-one cache(s) • Level-two cache • Most of microprocessordie area is cache!

  10. Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Technology Continues • Slow Memory • Implications • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading

  11. Future Technology Implications • For (at least) ten years, Moore’s Law continues • More repeated doublings of number of transistors per chip • Faster transistors • But hard for processor architects to use • More transistors due global wire delays • Faster transistors due too much dynamic power • Moreover, hitting a Rock: Slow Memory • Memory access = 100s floating-point multiplies! • a.k.a. Wall [Wulf & McKee 1995]

  12. Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

  13. I1 I2 window = 4 (64) Compute Phases I3 I4 Memory Phases Time  Time   Instrns  Instrns Impact of Slow Memory (Rock) • Off-Chip Misses are now hundreds of cycles • More Realistic Case Good Case!

  14. Implications of Slow Memory (Rock) • Increasing Memory Latency hides Compute Phase • Near Term Implications • Reduce memory latency • Fewer memory accesses • More Memory Level Parallelism (MLP) • Longer Term Implications • What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? • What can amazing speculative hardware do?

  15. Assessment So Far • Appears • Popular Moore’s Law (doubling performance)will end soon, regardless of thereal Moore’s Law (doubling transistors) • Processor performance hitting Rock (Slow Memory) • No known way to overcome this, unless • Redefine performance in Popular Moore’s Law • From Processor Performance • To Chip Performance

  16. Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Small & Large CMPs • CMP Systems • CMP Workload • Go to the Hard Place of Mainstream Multithreading

  17. Performance for Chip, not Processor or Thread • Chip Multiprocessing (CMP) • Replicate Processor • Private L1 Caches • Low latency • High bandwidth • Shared L2 Cache • Larger than if private

  18. Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing Piranha Processing Node

  19. I$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way CPU

  20. I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay CPU CPU CPU CPU ICS CPU CPU CPU CPU

  21. I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU

  22. MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU 8 banks @1.6GB/sec

  23. MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) prog., 1K instr., even/odd interleaving HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU

  24. MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth 4 Links @ 8GB/s HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU

  25. MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU Single Chip

  26. Single-Chip Piranha Performance • Piranha’s performance margin 3x for OLTP and 2.2x for DSS • Piranha has more outstanding misses  better utilizes memory system

  27. Simultaneous Multithreading (SMT) • Multiplex S logical processors on each processor • Replicate registers, share caches, & manage other parts • Implementation factors keep S small, e.g., 2-4 • Cost-effective gain if threads available • E.g, S=2  1.4x performance • Modest cost • Limits waste if additional logical processor(s) not used • Worthwhile CMP enhancement

  28. C M C Small CMP Systems • Use One CMP (with C cores of S-way SMT) • C=[2,16] & S=[2,4]  C*S = [4,64] • Size of a small PC! • Directly Connect CMP (C) toMemory Controller (M) or DRAM

  29. M M C C C C C C C C M M M M M M Processor-Centric Dance Hall Medium CMP Systems • Use 2-16 CMPs (with C cores of S-way SMT) • Smaller: 2*4*4 = 32 • Larger: 16*16*4 = 1024 • In a single cabinet • Connecting CMPs & Memory Controllers/DRAM & many issues

  30. Inflection Points • Inflection point occurs when • Smooth input change leads • Disruptive output change • Enough transistors for … • 1970s simple microprocessor • 1980s pipelined RISC • 1990s speculative out-of-order • 2000s … • CMP will be Server Inflection Point • Expect >10x performance for less cost • Implying, >>10x cost-performance • Early CMPs like old SMPs but expect dramatic advances!

  31. So What’s Wrong with CMP Picture? • Chip Multiprocessors • Allow profitable use of more transistors • Support modest to vast multithreading • Will be inflection point for commercial servers • But • Many workloads have single thread (available to run) • Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) • Go to a Hard Place • Make most workloads flourish with CMPs

  32. Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading • Parallel from Fringe to Center • For All of Computer Science!

  33. Thread Parallelism from Fringe to Center • History • Automatic Computer (vs. Human)  Computer • Digital Computer (vs. Analog)  Computer • Must Change • Parallel Computer (vs. Sequential)  Computer • Parallel Algorithm (vs. Sequential)  Algorithm • Parallel Programming (vs. Sequential)  Programming • Parallel Library (vs. Sequential)  Library • Parallel X (vs. Sequential)  X • Otherwise, repeated performance doublings unlikely

  34. Computer Architects Can Contribute • Chip Multiprocessor Design • Transcend pre-CMP multiprocessor design • Intra-CMP has lower latency & much higher bandwidth • Hide Multithreading (Helper Threads) • Assist Multithreading (Thread-Level Speculation) • Ease Multithreaded Programming (Transactions) • Provide a “Gentle Ramp to Parallelism” (Hennessy)

  35. But All of Computer Science is Needed • Hide Multithreading (Libraries & Compilers) • Assist Multithreading (Development Environments) • Ease Multithreaded Programming (Languages) • Divide & Conquer Multithreaded Complexity(Theory & Abstractions) • Must Enable • 99% of programmers think sequentially while • 99% of instructions execute in parallel • Enable a “Parallelism Superhighway”

  36. Summary • (Single-Threaded) Computing faces a Rock: Slow Memory • Popular Moore’s Law (doubling performance) will end soon • Chip Multiprocessing Can Help • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center • Necessary For Restoring Popular Moore’s Law

More Related