1 / 81

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures. Ph.D . Proposal Jeff Diamond Advisor: Stephen Keckler. Turning to Heterogeneous Chips. “We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005. AMD - TRINITY. N VIDIA Tegra 3.

ervin
Télécharger la présentation

Designing On-chip Memory Systems for Throughput Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

  2. Turning to Heterogeneous Chips “We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005 AMD - TRINITY NVIDIA Tegra 3 Intel – Ivy Bridge

  3. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

  4. Throughput Architectures (TA) • Key Features: • Break single application into threads • Use explicit parallelism • Optimize hardware for performance density • Not single thread performance • Benefits: • Drop voltage, peak frequency • quadratic improvement in power efficiency • Cores smaller, more energy efficient • Further amortize through multithreading, SIMD • Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs

  5. Scope – Highly Threaded TA • Architecture Continuum: • Multithreading • Large number of threads mask long latency • Small amount of cache primarily for bandwidth • Caching • Large amounts of cache to reduce latency • Small number of threads • Can we get benefits of both? Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread

  6. Problem - Technology Mismatch • Computation is cheap, data movement is expensive • Hit in L1 cache, 2.5x power of 64-bit FMADD • Move across chip, 50x power • Fetch from DRAM, 320x power • Limited off-chip bandwidth • Exponential growth in cores saturates BW • Performance capped • DRAM latency currently hundreds of cycles • Need hundreds of threads/core in flight to cover DRAM latency

  7. The Downward Spiral • Little’s Law • Threads needed is proportional to average latency • Opportunity cost in on-chip resources • Thread contexts • In flight memory accesses • Too many threads – negative feedback • Adding threads to cover latency increases latency • Slower register access, thread scheduling • Reduced Locality • Reduces bandwidth and DRAM efficiency • Reduces effectiveness of caching • Parallel starvation

  8. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

  9. Goal: Increase Parallel Efficiency • Problem: Too Many Threads! • Increase Parallel Efficiency, i.e. • Number of threads for given level of performance • Improves throughput performance • Apply low latency caches • Leverage upwards spiral • Difficult to mix multithreading and caching • Typically used just for bandwidth amplification • Important ancillary factors • Thread scheduling • Instruction Scheduling (per thread parallelism)

  10. Contributions • Quantifying the impact of single thread performance on throughput performance • Developing a mathematical analysis of throughput performance • Building a novel hybrid-trace based simulation infrastructure • Demonstrating unique architectural enhancements in thread scheduling and cache policies

  11. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  12. Mathematical Analysis • Why take a mathematical approach? • Be very precise about what we want to optimize • Understand the relationships and sensitivities to throughput performance • Single thread performance • Cache improvements • Application characteristics • Rapid evaluation of design space • Suggest most fruitful architectural improvements

  13. Modeling Throughput Performance PCHIP= Total throughput performance PST = Single thread performance NT = Total active threads LAVG = Average instruction latency PowerCHIP = EAVG(Joules/Ins)xPCHIP How can caches help throughput performance?

  14. Cache As A Performance Unit Area comparison: FPU = 2-11KB SRAM, 8-40KB eDRAM Active power: 20pJ / Op Leakage power: 1 watt/mm2 FMADD Active power: 50pJ/L1 access, 1.1nJ/L2 access Leakage power: 70 milliwatts/mm2 Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs Leakage power comparison: One FPU = ~64KB SRAM / 256KB eDRAM SRAM Key: How much does a thread need?

  15. Performance From Caching • Ignore changes to DRAM latency & off-chip BW • We will simulate these • Assume ideal caches (frequency) • What is the maximum performance benefit? A = Arithmetic intensity of application (fraction of non-memory instructions) NT = Total active threads on chip L = Latency Memory Intensity, M=1-A For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic

  16. Ideal Cache = Frequency Cache • Hit rate depends on amount of cache, application working set • Store items used the most times • This is the concept of “frequency” • Once we know an application’s memory access characteristics, we can model throughput performance

  17. Modeling Cache Performance F(c) H(c) PST(c)

  18. Cache Performance Per Thread PS(t) is a steep reciprocal

  19. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • “The Valley” • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  20. “The Valley” in Cache Space High Performance (flat access) = X

  21. “The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache

  22. Prior Work • Hong et al, 2009, 2010 • Simple, cacheless GPU models • Used to predict “MT peak” • Guz et al, 2008, 2010 • Graphed throughput performance with assumed cache profile • Identified “valley” structure • Validated against PARSEC benchmarks • No mathematical analysis • Minimal focus on bandwidth limited regime • CMP benchmarks • Galal et al, 2011 • Excellent mathematical analysis • Focused on FPU+Register design

  23. “The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache

  24. Energy vs Latency

  25. Valley – Energy Efficiency

  26. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  27. Contribution Thread Throttling • Have real time information: • Arithmetic intensity • Bandwidth utilization • Current hit rate • Conservatively approximatelocality • Approximate optimum operating points • Shut down / Activate threads to increase performance • Concentrate power and overclock • Clock off unused cache if no benefit

  28. Prior Work • Many studies in CMP and GPU area scale back threads • CMP – miss rates get too high • GPU – off-chip bandwidth is saturated • Simple to hit, unidirectional • Valley is much more complex • Two points to hit • Three different operating regimes • Mathematical analysis lets us approximate both points with as little as two samples • Both off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications

  29. Finding Optimal Points Cache Regime Valley MT Regime Width Cache No Cache

  30. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology • Proposed Work

  31. From Mathematical Analysis: • Need to work like LFU cache • Hard to implement in practice • Still very little cache per thread • Policies make big differences for small caches • Associativity a big issue for small caches • Cannot cache every line referenced • Beyond “dead line” prediction • Stream lines with lower reuse

  32. Contribution – Odd Set Indexing • Conflict misses pathological issue • Most often seen with power of 2 strides • Idea: map to 2N-1 sets/banks instead • True “Silver Bullet” • Virtually eliminates conflict misses in every setting we’ve tried • Reduced scratchpad banks from 32 to 7 at same level of bank conflicts • Fastest, most efficient implementation • Adds just a few gate delays • Logic area < 4% 32-bit integer multiply • Can still access last bank

  33. More Preliminary Results PARSEC L2 with 64 threads v Fully Associative Odd-set, 1 bank Direct Mapped, 1 bank

  34. Prior Work • Prime number of banks/sets thought ideal • No efficient implementation • Mersenne Primes not so convenient: • 3, 7, 15, 31, 63, 127, 255 • We demonstrated wasn’t an issue • Yang, ISCA ‘92 - prime strides for vector computers • Showed 3x speedup • We get correct offset for free • Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well • Our implementation faster, more features • Couldn’t appreciate benefits for SPEC

  35. (Re)placement Policies • Not all data should be cached • Recent papers for LLC caches • Hard drive cache algorithms • Frequency over Recency • Frequency hard to implement • ARC good compromise • Direct Mapping Replacement dominates • Look for explicit approaches • Priority Classes • Epochs

  36. Prior Work • Belady – solved it all • Three hierarchies of methods • Best utilized information of prior line usage • Light on implementation details • Approximations • Hallnor & Reinhardt, ISCA 2000 • Generational Replacement • Meggido, Usenet 2003, ARC cache • ghost entries • recencyand frequency groups • Qureshi, 2006, 2007 – Adaptive Insertion policies • Multiqueue, LR-K, D-NUCA, etc.

  37. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

  38. Benchmarks • Initially studied regular HPC kernels/applications in CMP environment • Dense Matrix Multiply • Fast Fourier Transform • Homme weather simulation • Added CUDA throughput benchmarks • Parboil – old school MPI, coarse grained • Rodinia – fine grained, varied • Benchmarks typical of historical GPGPU applications • Will add irregular benchmarks • SparseMM, Adaptive Finite Elements, Photon mapping

  39. Subset of Benchmarks

  40. Preliminary Results • Most of the benchmarks should benefit: • Small working sets • Concentrated working sets • Hit rate curves easy to predict

  41. Typical Concentration of Locality

  42. Scratchpad Task Locality

  43. Hybrid Simulator Design C++/CUDA Simulate Different Architecture Than Traced NVCC PTX Intermediate Assembly Listing Dynamic Trace Blocks Attachment Points Modify Ocelot Functional Sim Custom Trace Module Compressed Trace Data Custom Simulator Goals: Fast simulation, Overcome compiler issues for reasonable base case

  44. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

  45. Phase 1 – HPC Applications • Looked at GEMM, FFT & Homme in CMP setting • Learned implementation algorithms, alternative algorithms • Expertise allows for credible throughput analysis • Valuable Lessons in multithreading and caching • Dense Matrix Multiply • Blocking to maximize arithmetic intensity • Enough contexts to cover latency • Fast Fourier Transform • Pathologically hard on memory system • Communication & synchronization • HOMME – weather modeling • Intra-chip scaling incredibly difficult • Memory system performance variation • Replacing data movement with computation • First author publications: • PPoPP 2008, ISPASS 2011 (Best Paper)

  46. Phase 2 – Benchmark Characterization • Memory Access Characteristics of Rodinia and Parboil benchmarks • Apply Mathematical Analysis • Validate model • Find optimum operating points for benchmarks • Find optimum TA topology for benchmarks • NEARLY COMPLETE

  47. Phase 3 – Evaluate Enhancements • Automatic Thread Throttling • Low latency hierarchical cache • Benefits of odd-sets/odd-banking • Benefits of explicit placement (Priority/Epoch) • NEED FINAL EVALUATION and explicit placement study

  48. Final Phase – Extend Domain • Study regular HPC applications in throughput setting • Add at least two irregular benchmarks • Less likely to benefit from caching • New opportunities for enhancement • Explore impact of future TA topologies • Memory Cubes, TSV DRAM, etc.

  49. Conclusion • Dissertation Goals: • Quantify the degree single thread performance effects throughput performance for an important class of applications • Improve parallel efficiency through thread scheduling, cache topology, and cache policies • Feasibility • Regular Benchmarks show promising memory behavior • Cycle accurate simulator nearly completed

  50. Proposed Timeline • Phase 1 – HPC applications – completed • Phase 2 – Mathematical model & Benchmark Characterization • MAY-JUNE • Phase 3 – Architectural Enhancements • JULY-AUGUST • Phase 4 – Domain enhancement / new features • September-November

More Related