1 / 15

Adaptive Single-Chip Multiprocessing

Adaptive Single-Chip Multiprocessing. Dan Gibson degibson@wisc.edu University of Wisconsin-Madison Department of Electrical and Computer Engineering. Introduction. Moore’s Law continues to provide more transistors Devices are getting smaller Devices are getting faster

habib
Télécharger la présentation

Adaptive Single-Chip Multiprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Single-Chip Multiprocessing Dan Gibson degibson@wisc.edu University of Wisconsin-Madison Department of Electrical and Computer Engineering

  2. Introduction • Moore’s Law continues to provide more transistors • Devices are getting smaller • Devices are getting faster • Leads to increases in clock frequency • Memories are getting bigger • Large memories often require more time to access • RC Circuits continue to charge exponentially • Long-wire signal propagation time is not improving as rapidly as switching speed • On-chip communication time is slower relative to processor clock speeds ECE Qualifying Exam

  3. The Memory Wall • Processors grow faster, memory grows slower • Off-chip cache misses can halt even aggressive out-of-order processors • On-chip cache accesses are becoming long-latency events • Latency can sometimes be tolerated • Caching • Perfecting • Speculation • Out-of-order execution • Multithreading ECE Qualifying Exam

  4. The “Power” Wall • More devices, faster clocks => More power • Power supply accounts for lots of pins in chip packaging (3,057 of 5,370 pins on the POWER5) • Heat dissipation increases total cost of ownership (~34W cooling power required to remove 100W of heat) • Dynamic Power in CMOS • Devices get smaller, faster, and more numerous • More Capacitance • Higher Frequency • Architects can constrain α, CL, and f ECE Qualifying Exam

  5. Enter Chip Multiprocessors (CMPs) • One chip, many processors • Multiple cores per chip • Often multiple threads per core Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 ECE Qualifying Exam

  6. CMPs • CMPs can have good performance • Explicit thread-level parallelism • Related threads experience constructive prefetching • CMPs can tolerate long-latency events well • Many concurrent threads => long-latency memory accesses can be overlapped • CMPs can be power-efficient • Enables use of simpler cores • Distributes “hot spots” ECE Qualifying Exam

  7. CMPs • CMPs are very specialized • Assumes (highly) threaded workload • Parallel machines are difficult to use • Parallel programming is not (yet) commonplace • Many problems similar to traditional multiprocessors • Cache coherence • Memory consistency • Many new opportunities • Cache sharing • More integration ECE Qualifying Exam

  8. Adaptive CMPs • To combat specialization, adapt a CMP dynamically to its current workload and system: • Adapt caching policy ( Beckmann et. al., Chang et. al., and more ) • Adapt cache structure ( Alameldeen et. al., and more ) • Adapt thread scheduling ( Kihm et. Al., in the SMT space) • Current idea: • Adaptive thread scheduling from the space of un-stalled and stalled threads • A union of single-core multithreading and runahead execution in the context of CMPs ECE Qualifying Exam

  9. Single-Core Multithreading • Allow multiple (HW) threads within the same execution pipeline • Shares processor resources: FUs, Decode, ROB, etc. • Shares local memory resources: L1 caches, LSQ, etc. • Can increase processor and memory utilization Sun’s Niagara pipeline block diagram ( Kongetira et. al.) ECE Qualifying Exam

  10. Runahead Execution • Continue execution in the face of a cache miss • “Checkpoint” architectural state • Continue execution speculatively • Convert memory accesses to prefetches • “Runahead” prefetches can be highly accurate, and can greatly improve cache performance ( Mutlu, et. al.) • It is possible to issue useless prefetches • Can be power-inefficient (Mutlu, et. al.) ECE Qualifying Exam

  11. Runahead/Multithreaded Core Interaction • Similar Hardware Requirements: • Additional register files • Additional LSQ entries • Competition for Similar Resources: • Execution time (Processor pipeline, Functional units, etc) • Memory bandwidth • TLB Entries, cache space, etc. ECE Qualifying Exam

  12. Runahead/Multithreaded Core Interaction • A multithreaded core in a CMP, with runahead, must make a difficult scheduling decisions: • Thread scheduling considerations: • Which thread should run? • Should the thread use runahead? • How long should the thread run/runahead? • Scheduling implications: • Is an idle thread making foreword progress at the expense of a useful thread? • Is a thread spinning on a lock held by another thread? • Is runahead effective for a given thread? • Is a given thread causing performance problems elsewhere in the CMP? ECE Qualifying Exam

  13. Proposed Mechanism • Track per-thread state on: • Runahead prefetching accuracy • High accuracy favors allowing thread to runahead • HW-assigned thread priority • Highly “useful” threads are preferred • Selection criteria: • Heuristic-guided • Select the best priority/accuracy pair • Probabilistically-guided • Select a thread with likelihood proportional to its priority/accuracy • Useful-first • Select non-runahead threads first, then select runahead threads ECE Qualifying Exam

  14. Future Directions • Dynamically Adaptable CMPs offer several future areas of research: • Adapt for power savings / heat dissipation • Computation relocation, load balancing, automatic low-power modes, etc. • Adapt to error conditions • Dynamically allocate backup threads • Automatically relocate threads to improve resource sharing • Combined HW/SW/VM approach ECE Qualifying Exam

  15. Summary • Latency now dominates off-chip communication • On-chip communication isn’t far behind • Many techniques to tolerate latency, including multithreading • CMPs provide new challenges and opportunities to computer architects • Latency tolerance • Potential for power savings • Can adapt a CMP’s behavior to its workload • Dynamic management of shared resources ECE Qualifying Exam

More Related