1 / 51

Advanced Microarchitecture

Advanced Microarchitecture. Multi-This , Multi-That, …. Limits on IPC. Lam92 This paper focused on impact of control flow on ILP Speculative execution can expose 10-400 IPC assumes no machine limitations except for control dependencies and actual dataflow dependencies Wall91

Télécharger la présentation

Advanced Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Microarchitecture Multi-This, Multi-That, …

  2. Limits on IPC • Lam92 • This paper focused on impact of control flow on ILP • Speculative execution can expose 10-400 IPC • assumes no machine limitations except for control dependencies and actual dataflow dependencies • Wall91 • This paper looked at limits more broadly • No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC • ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC • perfect bpred, register renaming and memory disambiguation: 7-60 IPC • This paper did not consider “control independent” instructions Lecture 17: Multi-This, Multi-That, ...

  3. Practical Limits • Today, 1-2 IPC sustained • far from the 10’s-100’s reported by limit studies • Limited by: • branch prediction accuracy • underlying DFG • influenced by algorithms, compiler • memory bottleneck • design complexity • implementation, test, validation, manufacturing, etc. • power • die area Lecture 17: Multi-This, Multi-That, ...

  4. Differences BetweenReal Hardware and Limit Studies? • Real branch predictors aren’t 100% accurate • Memory disambiguation is not perfect • Physical resources are limited • can’t have infinite register renaming w/o infinite PRF • need infinite-entry ROB, RS and LSQ • need 10’s-100’s of execution units for 10’s-100’s of IPC • Bandwidth/Latencies are limited • studies assumed single-cycle execution • infinite fetch/commit bandwidth • infinite memory bandwidth (perfect caching) Lecture 17: Multi-This, Multi-That, ...

  5. Bridging the Gap Power has been growing exponentially as well Watts/ IPC 100 10 1 Diminishing returns w.r.t. larger instruction window, higher issue-width Single-Issue Pipelined Limits Superscalar Out-of-Order (Today) Superscalar Out-of-Order (Hypothetical- Aggressive) Lecture 17: Multi-This, Multi-That, ...

  6. Past the Knee of the Curve? Made sense to go Superscalar/OOO: good ROI Performance Very little gain for substantial effort “Effort” Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Lecture 17: Multi-This, Multi-That, ...

  7. So how do we get more Performance? • Keep pushing IPC and/or frequenecy? • possible, but too costly • design complexity (time to market), cooling (cost), power delivery (cost), etc. • Look for other parallelism • ILP/IPC: fine-grained parallelism • Multi-programming: coarse grained parallelism • assumes multiple user-visible processing elements • all parallelism up to this point was user-invisible Lecture 17: Multi-This, Multi-That, ...

  8. User Visible/Invisible • All microarchitecture performance gains up to this point were “free” • free in that no user intervention required beyond buying the new processor/system • recompilation/rewriting could provide even more benefit, but you get some even if you do nothing • Multi-processing pushes the problem of finding the parallelism to above the ISA interface Lecture 17: Multi-This, Multi-That, ...

  9. 4-wide OOO CPU Task A Task B Benefit Task A 3-wide OOO CPU 3-wide OOO CPU Task B Task A 2-wide OOO CPU 2-wide OOO CPU Task B Workload Benefits runtime Task A Task B 3-wide OOO CPU This assumes you have two tasks/programs to execute… Lecture 17: Multi-This, Multi-That, ...

  10. … If Only One Task runtime Task A 3-wide OOO CPU 4-wide OOO CPU Task A Benefit Task A 3-wide OOO CPU 3-wide OOO CPU No benefit over 1 CPU Task A 2-wide OOO CPU 2-wide OOO CPU Performance degradation! Idle Lecture 17: Multi-This, Multi-That, ...

  11. Sources of (Coarse) Parallelism • Different applications • MP3 player in background while you work on Office • Other background tasks: OS/kernel, virus check, etc. • Piped applications • gunzip -c foo.gz | grep bar | perl some-script.pl • Within the same application • Java (scheduling, GC, etc.) • Explicitly coded multi-threading • pthreads, MPI, etc. Lecture 17: Multi-This, Multi-That, ...

  12. (Execution) Latency vs. Bandwidth • Desktop processing • typically want an application to execute as quickly as possible (minimize latency) • Server/Enterprise processing • often throughput oriented (maximize bandwidth) • latency of individual task less important • ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time Lecture 17: Multi-This, Multi-That, ...

  13. parallelizable 1CPU 2CPUs 3CPUs 4CPUs Benefit of MP Depends on Workload • Limited number of parallel tasks to run on PC • adding more CPUs than tasks provide zero performance benefit • Even for parallel code, Amdahl’s law will likely result in sub-linear speedup • In practice, parallelizable portion may not be evenly divisible Lecture 17: Multi-This, Multi-That, ...

  14. Cache Coherency Protocols • Not covered in this course • You should have seen a bunch of this in CS6290 • Many different protocols • different number of states • different bandwidth/performance/complexity tradeoffs • current protocols usually referred to by their states • ex. MESI, MOESI, etc. Lecture 17: Multi-This, Multi-That, ...

  15. Shared Memory Focus • Most small-medium multi-processors (these days) use some sort of shared memory • shared memory doesn’t scale as well to larger number of nodes • communications are broadcast based • bus becomes a severe bottleneck • or you have to deal with directory-based implementations • message passing doesn’t need centralized bus • can arrange multi-processor like a graph • nodes = CPUs, edges = independent links/routes • can have multiple communications/messages in transit at the same time Lecture 17: Multi-This, Multi-That, ...

  16. SMP Machines • SMP = Symmetric Multi-Processing • Symmetric = All CPUs are “equal” • Equal = any process can run on any CPU • contrast with older parallel systems with master CPU and multiple worker CPUs CPU0 CPU1 CPU2 CPU3 Pictures found from google images Lecture 17: Multi-This, Multi-That, ...

  17. Hardware Modifications for SMP • Processor • mainly support for cache coherence protocols • includes caches, write buffers, LSQ • control complexity increases, as memory latencies may be substantially more variable • Motherboard • multiple sockets (one per CPU) • datapaths between CPUs and memory controller • Other • Case: larger for bigger mobo, better airflow • Power: bigger power supply for N CPUs • Cooling: need to remove N CPUs’ worth of heat Lecture 17: Multi-This, Multi-That, ...

  18. Chip-Multiprocessing • Simple SMP on the same chip Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX Lecture 17: Multi-This, Multi-That, ... Pictures found from google images

  19. Shared Caches • Resources can be shared between CPUs • ex. IBM Power 5 CPU0 CPU1 L2 cache shared between both CPUs (no need to keep two copies coherent) L3 cache is also shared (only tags are on-chip; data are off-chip) Lecture 17: Multi-This, Multi-That, ...

  20. Benefits? • Cheaper than mobo-based SMP • all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory) • less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication) • Performance • on-chip communication is faster • Efficiency • potentially better use of hardware resources than trying to make wider/more OOO single-threaded CPU Lecture 17: Multi-This, Multi-That, ...

  21. Performance vs. Power • 2x CPUs not necessarily equal to 2x performance • 2x CPUs  ½ power for each • maybe a little better than ½ if resources can be shared • Back-of-the-Envelope calculation: • 3.8 GHz CPU at 100W • Dual-core: 50W per CPU • P  V3: Vorig3/VCMP3 = 100W/50W  VCMP = 0.8 Vorig • f  V: fCMP = 3.0GHz Lecture 17: Multi-This, Multi-That, ...

  22. Simultaneous Multi-Threading • Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC • poor utilization • SMP: 2-4 CPUs, but need independent tasks • else poor utilization as well • SMT: Idea is to use a single large uni-processor as a multi-processor Lecture 17: Multi-This, Multi-That, ...

  23. SMT (4 threads) CMP 2x HW Cost Approx 1x HW Cost SMT (2) Regular CPU Lecture 17: Multi-This, Multi-That, ...

  24. Overview of SMT Hardware Changes • For an N-way (N threads) SMT, we need: • Ability to fetch from N threads • N sets of registers (including PCs) • N rename tables (RATs) • N virtual memory spaces • But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.) Lecture 17: Multi-This, Multi-That, ...

  25. Cycle-Multiplexed fetch logic RS PC0 I$ Decode, etc. fetch PC1 PC2 cycle % N SMT Fetch • Duplicate fetch logic RS fetch PC0 Decode, Rename, Dispatch I$ PC1 fetch PC2 fetch • Alternatives • Other-Multiplexed fetch logic • Duplicate I$ as well Lecture 17: Multi-This, Multi-That, ...

  26. SMT Rename • Thread #1’s R12 != Thread #2’s R12 • separate name spaces • need to disambiguate RAT0 PRF RAT PRF Thread0 Register # Thread-ID concat RAT1 Thread1 Register # Register # Lecture 17: Multi-This, Multi-That, ...

  27. Shared RS Entries Sub T5 = T17 – T2 Add T12 = RT20 + T8 Load T25 = 0[T31] Xor T14 = T12 ^ T19 Load T23 = 0[T14] Sub T19 = T12 – T16 Xor T31 = T17 ^ T5 Add T17 = RT29 + T3 SMT Issue, Exec, Bypass, … • No change needed After Renaming Thread 0: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 0: Add T12 = RT20 + T8 Sub T19 = T12 – T16 Xor T14 = T12 ^ T19 Load T23 = 0[T14] Thread 1: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 1: Add T17 = RT29 + T3 Sub T5 = T17 – T2 XorT31 = T17 ^ T5 Load T25 = 0[T31] Lecture 17: Multi-This, Multi-That, ...

  28. SMT Cache • Each process has own virtual address space • TLB must be thread-aware • translate (thread-id,virtual page)  physical page • Virtual portion of caches must also be thread-aware • VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged • Similar for VIPT cache Lecture 17: Multi-This, Multi-That, ...

  29. SMT Commit • One “Commit PC” per thread • Register File Management • ARF/PRF organization • need one ARF per thread • Unified PRF • need one “architected RAT” per thread • Need to maintain interrupts, exceptions, faults on a per-thread basis • like OOO needs to appear to outside world that it is in-order, SMT needs to appear as if it is actually N CPUs Lecture 17: Multi-This, Multi-That, ...

  30. SMT Design Space • Number of threads • Full-SMT vs. Hard-partitioned SMT • full-SMT: ROB-entries can be allocated arbitrarily between the threads • hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc. • Amount of duplication • Duplicate I$, D$, fetch engine, decoders, schedulers, etc.? • There’s a continuum of possibilities between SMT and CMP • ex. could have CMP where FP unit is shared SMT-styled Lecture 17: Multi-This, Multi-That, ...

  31. SMT Performance • When it works, it fills idle “issue slots” with work from other threads; throughput improves • But sometimes it can cause performance degradation! Time( ) < Time( ) Finish one task, then do the other Do both at same time using SMT Lecture 17: Multi-This, Multi-That, ...

  32. How? • Cache thrashing L2 I$ D$ Executes reasonably quickly due to high cache hit rates Thread0 just fits in the Level-1 Caches I$ D$ Caches were just big enough to hold one thread’s data, but not two thread’s worth Context switch to Thread1 I$ D$ Now both threads have significantly higher cache miss rates Thread1 also fits nicely in the caches Lecture 17: Multi-This, Multi-That, ...

  33. Fairness • Consider two programs • By themselves: • Program A: runtime = 10 seconds • Program B: runtime = 10 seconds • On SMT: • Program A: runtime = 14 seconds • Program B: runtime = 18 seconds • Standard Deviation of Speedups (lower = better) • A’s speedup: 10/14 = 0.71 • B’s speedup: 10/18 = 0.56 • SDS = 0.11 Lecture 17: Multi-This, Multi-That, ...

  34. Fairness (2) • SDS encourages everyone to be punished similarly • does not account for actual performance, so if everyone is 1000x slower, it’s still “fair” • Alternative: Harmonic Mean of Weighted IPCs (HMWIPC) • IPCi = achieved IPC for thread i • SingleIPCi = IPC when thread i runs alone • HMWIPC = N SingleIPC1 + SingleIPC2 + … + SingleIPCN IPC1 IPC2 IPCN Lecture 17: Multi-This, Multi-That, ...

  35. This is all combinable • Can have a system that supports SMP, CMP and SMT at the same time • Take a dual-socket SMP motherboard… • Insert two chips, each with a dual-core CMP… • Where each core supports two-way SMT • This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages Lecture 17: Multi-This, Multi-That, ...

  36. OS Confusion • SMT/CMP is supposed to look like multiple CPUs to the software/OS Performance worse than if SMT was turned off and used 2-way SMP only A CPU0 2-way SMT A/B B CPU1 idle 2-way SMT idle CPU2 idle CPU3 2 cores (either SMP/CMP) Say OS has two tasks to run… Schedule tasks to (virtual) CPUs Lecture 17: Multi-This, Multi-That, ...

  37. OS Confusion (2) • Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with • need to break abstraction: OS needs to know which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT) • Distinct applications should be scheduled to physically different CPUs • no cache contention, no power contention • Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP) • reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used • Use SMT as last choice Lecture 17: Multi-This, Multi-That, ...

  38. Multi-* is Happening • Intel Pentium 4 already had “Hyperthreading” (SMT) • went away for a while, but is back in Core i7 • IBM Power 5 and later have SMT • Dual, Quad core already here • Octo-core soon • Intel Core i7: 8 cores, each with 2-thread SMT • So is single-thread performance dead? • Is single-thread microarchitecture performance dead? Following adapted from Mark Hill’s HPCA08 keynote talk Lecture 17: Multi-This, Multi-That, ...

  39. Recall Amdahl’s Law • Begins with Simple Software Assumption (Limit Arg.) • Fraction F of execution time perfectly parallelizable • No Overhead for • Scheduling • Synchronization • Communication, etc. • Fraction1 – F Completely Serial • Time on 1 core = (1 – F) / 1 + F / 1 = 1 • Time on N cores = (1 – F) / 1 + F / N The following slides derived from Mark Hill’s HPCA’08 Keynote

  40. 1 Amdahl’s Speedup = F 1 - F + 1 N Recall Amdahl’s Law [1967] • For mainframes, Amdahl expected 1 - F = 35% • For a 4-processor speedup = 2 • For infinite-processor speedup < 3 • Therefore, stay with mainframes with one/few processors • Do multicore chips repeal Amdahl’s Law? • Answer: No, But.

  41. Designing Multicore Chips Hard • Designers must confront single-core design options • Instruction fetch, wakeup, select • Execution unit configuation & operand bypass • Load/queue(s) & data cache • Checkpoint, log, runahead, commit. • As well as additional design degrees of freedom • How many cores? How big each? • Shared caches: levels? How many banks? • Memory interface: How many banks? • On-chip interconnect: bus, switched, ordered?

  42. Want Simple Multicore Hardware Model To Complement Amdahl’s Simple Software Model (1) Chip Hardware Roughly Partitioned into • Multiple Cores (with L1 caches) • The Rest (L2/L3 cache banks, interconnect, pads, etc.) • Changing Core Size/Number does NOT change The Rest (2) Resources for Multiple Cores Bounded • Bound of N resources per chip for cores • Due to area, power, cost ($$$), or multiple factors • Bound = Power? (but our pictures use Area)

  43. Want Simple Multicore Hardware Model, cont. (3) Micro-architects can improve single-core performance using more of the bounded resource • A Simple Base Core • Consumes 1 Base Core Equivalent (BCE) resources • Provides performance normalized to 1 • An Enhanced Core (in same process generation) • Consumes R BCEs • Performance as a function Perf(R) • What does functionPerf(R) look like?

  44. More on Enhanced Cores • (Performance Perf(R) consuming R BCEs resources) • If Perf(R) > R Always enhance core • Cost-effectively speedups both sequential & parallel • Therefore, Equations Assume Perf(R) < R • Graphs Assume Perf(R) = square root of R • 2x performance for 4 BCEs, 3x for 9 BCEs, etc. • Why? Models diminishing returns with “no coefficients” • How to speedup enhanced core? • <Insert favorite or TBD micro-architectural ideas here>

  45. How Many (Symmetric) Cores per Chip? • Each Chip Bounded to N BCEs (for all cores) • Each Core consumes R BCEs • Assume SymmetricMulticore = All Cores Identical • Therefore, N/R Cores per Chip —(N/R)*R = N • For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

  46. 1 Symmetric Speedup = F * R 1 - F + Perf(R) Perf(R)*N Enhanced Cores speed Serial & Parallel Performance of Symmetric Multicore Chips • Serial Fraction 1-F uses 1 core at rate Perf(R) • Serial time = (1 – F) / Perf(R) • Parallel Fraction uses N/R cores at rate Perf(R) each • Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N • Therefore, w.r.t. one base core: • Implications?

  47. Symmetric Multicore Chip, N = 16 BCEs F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal! F=0.5 R=16, Cores=1, Speedup=4 (16 cores) (8 cores) (2 cores) (1 core) (4 cores)

  48. Symmetric Multicore Chip, N = 16 BCEs At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism! F=0.9, R=2, Cores=8, Speedup=6.7 F=0.5 R=16, Cores=1, Speedup=4

  49. Symmetric Multicore Chip, N = 16 BCEs F1, R=1, Cores=16, Speedup16 F matters: Amdahl’s Law applies to multicore chips Researchers should target parallelism F first

  50. Symmetric Multicore Chip, N = 16 BCEs Recall F=0.9, R=2, Cores=8, Speedup=6.7 As Moore’s Law enables N to go from 16 to 256 BCEs, More core enhancements? More cores? Or both?

More Related