810 likes | 928 Vues
Designing On-chip Memory Systems for Throughput Architectures. Ph.D . Proposal Jeff Diamond Advisor: Stephen Keckler. Turning to Heterogeneous Chips. “We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005. AMD - TRINITY. N VIDIA Tegra 3.
E N D
Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler
Turning to Heterogeneous Chips “We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005 AMD - TRINITY NVIDIA Tegra 3 Intel – Ivy Bridge
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work
Throughput Architectures (TA) • Key Features: • Break single application into threads • Use explicit parallelism • Optimize hardware for performance density • Not single thread performance • Benefits: • Drop voltage, peak frequency • quadratic improvement in power efficiency • Cores smaller, more energy efficient • Further amortize through multithreading, SIMD • Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs
Scope – Highly Threaded TA • Architecture Continuum: • Multithreading • Large number of threads mask long latency • Small amount of cache primarily for bandwidth • Caching • Large amounts of cache to reduce latency • Small number of threads • Can we get benefits of both? Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread
Problem - Technology Mismatch • Computation is cheap, data movement is expensive • Hit in L1 cache, 2.5x power of 64-bit FMADD • Move across chip, 50x power • Fetch from DRAM, 320x power • Limited off-chip bandwidth • Exponential growth in cores saturates BW • Performance capped • DRAM latency currently hundreds of cycles • Need hundreds of threads/core in flight to cover DRAM latency
The Downward Spiral • Little’s Law • Threads needed is proportional to average latency • Opportunity cost in on-chip resources • Thread contexts • In flight memory accesses • Too many threads – negative feedback • Adding threads to cover latency increases latency • Slower register access, thread scheduling • Reduced Locality • Reduces bandwidth and DRAM efficiency • Reduces effectiveness of caching • Parallel starvation
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work
Goal: Increase Parallel Efficiency • Problem: Too Many Threads! • Increase Parallel Efficiency, i.e. • Number of threads for given level of performance • Improves throughput performance • Apply low latency caches • Leverage upwards spiral • Difficult to mix multithreading and caching • Typically used just for bandwidth amplification • Important ancillary factors • Thread scheduling • Instruction Scheduling (per thread parallelism)
Contributions • Quantifying the impact of single thread performance on throughput performance • Developing a mathematical analysis of throughput performance • Building a novel hybrid-trace based simulation infrastructure • Demonstrating unique architectural enhancements in thread scheduling and cache policies
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work
Mathematical Analysis • Why take a mathematical approach? • Be very precise about what we want to optimize • Understand the relationships and sensitivities to throughput performance • Single thread performance • Cache improvements • Application characteristics • Rapid evaluation of design space • Suggest most fruitful architectural improvements
Modeling Throughput Performance PCHIP= Total throughput performance PST = Single thread performance NT = Total active threads LAVG = Average instruction latency PowerCHIP = EAVG(Joules/Ins)xPCHIP How can caches help throughput performance?
Cache As A Performance Unit Area comparison: FPU = 2-11KB SRAM, 8-40KB eDRAM Active power: 20pJ / Op Leakage power: 1 watt/mm2 FMADD Active power: 50pJ/L1 access, 1.1nJ/L2 access Leakage power: 70 milliwatts/mm2 Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs Leakage power comparison: One FPU = ~64KB SRAM / 256KB eDRAM SRAM Key: How much does a thread need?
Performance From Caching • Ignore changes to DRAM latency & off-chip BW • We will simulate these • Assume ideal caches (frequency) • What is the maximum performance benefit? A = Arithmetic intensity of application (fraction of non-memory instructions) NT = Total active threads on chip L = Latency Memory Intensity, M=1-A For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic
Ideal Cache = Frequency Cache • Hit rate depends on amount of cache, application working set • Store items used the most times • This is the concept of “frequency” • Once we know an application’s memory access characteristics, we can model throughput performance
Modeling Cache Performance F(c) H(c) PST(c)
Cache Performance Per Thread PS(t) is a steep reciprocal
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • “The Valley” • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work
“The Valley” in Cache Space High Performance (flat access) = X
“The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache
Prior Work • Hong et al, 2009, 2010 • Simple, cacheless GPU models • Used to predict “MT peak” • Guz et al, 2008, 2010 • Graphed throughput performance with assumed cache profile • Identified “valley” structure • Validated against PARSEC benchmarks • No mathematical analysis • Minimal focus on bandwidth limited regime • CMP benchmarks • Galal et al, 2011 • Excellent mathematical analysis • Focused on FPU+Register design
“The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work
Contribution Thread Throttling • Have real time information: • Arithmetic intensity • Bandwidth utilization • Current hit rate • Conservatively approximatelocality • Approximate optimum operating points • Shut down / Activate threads to increase performance • Concentrate power and overclock • Clock off unused cache if no benefit
Prior Work • Many studies in CMP and GPU area scale back threads • CMP – miss rates get too high • GPU – off-chip bandwidth is saturated • Simple to hit, unidirectional • Valley is much more complex • Two points to hit • Three different operating regimes • Mathematical analysis lets us approximate both points with as little as two samples • Both off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications
Finding Optimal Points Cache Regime Valley MT Regime Width Cache No Cache
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology • Proposed Work
From Mathematical Analysis: • Need to work like LFU cache • Hard to implement in practice • Still very little cache per thread • Policies make big differences for small caches • Associativity a big issue for small caches • Cannot cache every line referenced • Beyond “dead line” prediction • Stream lines with lower reuse
Contribution – Odd Set Indexing • Conflict misses pathological issue • Most often seen with power of 2 strides • Idea: map to 2N-1 sets/banks instead • True “Silver Bullet” • Virtually eliminates conflict misses in every setting we’ve tried • Reduced scratchpad banks from 32 to 7 at same level of bank conflicts • Fastest, most efficient implementation • Adds just a few gate delays • Logic area < 4% 32-bit integer multiply • Can still access last bank
More Preliminary Results PARSEC L2 with 64 threads v Fully Associative Odd-set, 1 bank Direct Mapped, 1 bank
Prior Work • Prime number of banks/sets thought ideal • No efficient implementation • Mersenne Primes not so convenient: • 3, 7, 15, 31, 63, 127, 255 • We demonstrated wasn’t an issue • Yang, ISCA ‘92 - prime strides for vector computers • Showed 3x speedup • We get correct offset for free • Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well • Our implementation faster, more features • Couldn’t appreciate benefits for SPEC
(Re)placement Policies • Not all data should be cached • Recent papers for LLC caches • Hard drive cache algorithms • Frequency over Recency • Frequency hard to implement • ARC good compromise • Direct Mapping Replacement dominates • Look for explicit approaches • Priority Classes • Epochs
Prior Work • Belady – solved it all • Three hierarchies of methods • Best utilized information of prior line usage • Light on implementation details • Approximations • Hallnor & Reinhardt, ISCA 2000 • Generational Replacement • Meggido, Usenet 2003, ARC cache • ghost entries • recencyand frequency groups • Qureshi, 2006, 2007 – Adaptive Insertion policies • Multiqueue, LR-K, D-NUCA, etc.
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work
Benchmarks • Initially studied regular HPC kernels/applications in CMP environment • Dense Matrix Multiply • Fast Fourier Transform • Homme weather simulation • Added CUDA throughput benchmarks • Parboil – old school MPI, coarse grained • Rodinia – fine grained, varied • Benchmarks typical of historical GPGPU applications • Will add irregular benchmarks • SparseMM, Adaptive Finite Elements, Photon mapping
Preliminary Results • Most of the benchmarks should benefit: • Small working sets • Concentrated working sets • Hit rate curves easy to predict
Hybrid Simulator Design C++/CUDA Simulate Different Architecture Than Traced NVCC PTX Intermediate Assembly Listing Dynamic Trace Blocks Attachment Points Modify Ocelot Functional Sim Custom Trace Module Compressed Trace Data Custom Simulator Goals: Fast simulation, Overcome compiler issues for reasonable base case
Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work
Phase 1 – HPC Applications • Looked at GEMM, FFT & Homme in CMP setting • Learned implementation algorithms, alternative algorithms • Expertise allows for credible throughput analysis • Valuable Lessons in multithreading and caching • Dense Matrix Multiply • Blocking to maximize arithmetic intensity • Enough contexts to cover latency • Fast Fourier Transform • Pathologically hard on memory system • Communication & synchronization • HOMME – weather modeling • Intra-chip scaling incredibly difficult • Memory system performance variation • Replacing data movement with computation • First author publications: • PPoPP 2008, ISPASS 2011 (Best Paper)
Phase 2 – Benchmark Characterization • Memory Access Characteristics of Rodinia and Parboil benchmarks • Apply Mathematical Analysis • Validate model • Find optimum operating points for benchmarks • Find optimum TA topology for benchmarks • NEARLY COMPLETE
Phase 3 – Evaluate Enhancements • Automatic Thread Throttling • Low latency hierarchical cache • Benefits of odd-sets/odd-banking • Benefits of explicit placement (Priority/Epoch) • NEED FINAL EVALUATION and explicit placement study
Final Phase – Extend Domain • Study regular HPC applications in throughput setting • Add at least two irregular benchmarks • Less likely to benefit from caching • New opportunities for enhancement • Explore impact of future TA topologies • Memory Cubes, TSV DRAM, etc.
Conclusion • Dissertation Goals: • Quantify the degree single thread performance effects throughput performance for an important class of applications • Improve parallel efficiency through thread scheduling, cache topology, and cache policies • Feasibility • Regular Benchmarks show promising memory behavior • Cycle accurate simulator nearly completed
Proposed Timeline • Phase 1 – HPC applications – completed • Phase 2 – Mathematical model & Benchmark Characterization • MAY-JUNE • Phase 3 – Architectural Enhancements • JULY-AUGUST • Phase 4 – Domain enhancement / new features • September-November