Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

Turning to Heterogeneous Chips “We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005 AMD - TRINITY NVIDIA Tegra 3 Intel – Ivy Bridge

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

Throughput Architectures (TA) • Key Features: • Break single application into threads • Use explicit parallelism • Optimize hardware for performance density • Not single thread performance • Benefits: • Drop voltage, peak frequency • quadratic improvement in power efficiency • Cores smaller, more energy efficient • Further amortize through multithreading, SIMD • Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs

Scope – Highly Threaded TA • Architecture Continuum: • Multithreading • Large number of threads mask long latency • Small amount of cache primarily for bandwidth • Caching • Large amounts of cache to reduce latency • Small number of threads • Can we get benefits of both? Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread

Problem - Technology Mismatch • Computation is cheap, data movement is expensive • Hit in L1 cache, 2.5x power of 64-bit FMADD • Move across chip, 50x power • Fetch from DRAM, 320x power • Limited off-chip bandwidth • Exponential growth in cores saturates BW • Performance capped • DRAM latency currently hundreds of cycles • Need hundreds of threads/core in flight to cover DRAM latency

The Downward Spiral • Little’s Law • Threads needed is proportional to average latency • Opportunity cost in on-chip resources • Thread contexts • In flight memory accesses • Too many threads – negative feedback • Adding threads to cover latency increases latency • Slower register access, thread scheduling • Reduced Locality • Reduces bandwidth and DRAM efficiency • Reduces effectiveness of caching • Parallel starvation

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

Goal: Increase Parallel Efficiency • Problem: Too Many Threads! • Increase Parallel Efficiency, i.e. • Number of threads for given level of performance • Improves throughput performance • Apply low latency caches • Leverage upwards spiral • Difficult to mix multithreading and caching • Typically used just for bandwidth amplification • Important ancillary factors • Thread scheduling • Instruction Scheduling (per thread parallelism)

Contributions • Quantifying the impact of single thread performance on throughput performance • Developing a mathematical analysis of throughput performance • Building a novel hybrid-trace based simulation infrastructure • Demonstrating unique architectural enhancements in thread scheduling and cache policies

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

Mathematical Analysis • Why take a mathematical approach? • Be very precise about what we want to optimize • Understand the relationships and sensitivities to throughput performance • Single thread performance • Cache improvements • Application characteristics • Rapid evaluation of design space • Suggest most fruitful architectural improvements

Modeling Throughput Performance PCHIP= Total throughput performance PST = Single thread performance NT = Total active threads LAVG = Average instruction latency PowerCHIP = EAVG(Joules/Ins)xPCHIP How can caches help throughput performance?

Cache As A Performance Unit Area comparison: FPU = 2-11KB SRAM, 8-40KB eDRAM Active power: 20pJ / Op Leakage power: 1 watt/mm2 FMADD Active power: 50pJ/L1 access, 1.1nJ/L2 access Leakage power: 70 milliwatts/mm2 Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs Leakage power comparison: One FPU = ~64KB SRAM / 256KB eDRAM SRAM Key: How much does a thread need?

Performance From Caching • Ignore changes to DRAM latency & off-chip BW • We will simulate these • Assume ideal caches (frequency) • What is the maximum performance benefit? A = Arithmetic intensity of application (fraction of non-memory instructions) NT = Total active threads on chip L = Latency Memory Intensity, M=1-A For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic

Ideal Cache = Frequency Cache • Hit rate depends on amount of cache, application working set • Store items used the most times • This is the concept of “frequency” • Once we know an application’s memory access characteristics, we can model throughput performance

Modeling Cache Performance F(c) H(c) PST(c)

Cache Performance Per Thread PS(t) is a steep reciprocal

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • “The Valley” • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

“The Valley” in Cache Space High Performance (flat access) = X

“The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache

Prior Work • Hong et al, 2009, 2010 • Simple, cacheless GPU models • Used to predict “MT peak” • Guz et al, 2008, 2010 • Graphed throughput performance with assumed cache profile • Identified “valley” structure • Validated against PARSEC benchmarks • No mathematical analysis • Minimal focus on bandwidth limited regime • CMP benchmarks • Galal et al, 2011 • Excellent mathematical analysis • Focused on FPU+Register design

“The Valley” In Thread Space Cache Regime Valley MT Regime Width Cache No Cache

Energy vs Latency

Valley – Energy Efficiency

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

Contribution Thread Throttling • Have real time information: • Arithmetic intensity • Bandwidth utilization • Current hit rate • Conservatively approximatelocality • Approximate optimum operating points • Shut down / Activate threads to increase performance • Concentrate power and overclock • Clock off unused cache if no benefit

Prior Work • Many studies in CMP and GPU area scale back threads • CMP – miss rates get too high • GPU – off-chip bandwidth is saturated • Simple to hit, unidirectional • Valley is much more complex • Two points to hit • Three different operating regimes • Mathematical analysis lets us approximate both points with as little as two samples • Both off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications

Finding Optimal Points Cache Regime Valley MT Regime Width Cache No Cache

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology • Proposed Work

From Mathematical Analysis: • Need to work like LFU cache • Hard to implement in practice • Still very little cache per thread • Policies make big differences for small caches • Associativity a big issue for small caches • Cannot cache every line referenced • Beyond “dead line” prediction • Stream lines with lower reuse

Contribution – Odd Set Indexing • Conflict misses pathological issue • Most often seen with power of 2 strides • Idea: map to 2N-1 sets/banks instead • True “Silver Bullet” • Virtually eliminates conflict misses in every setting we’ve tried • Reduced scratchpad banks from 32 to 7 at same level of bank conflicts • Fastest, most efficient implementation • Adds just a few gate delays • Logic area < 4% 32-bit integer multiply • Can still access last bank

More Preliminary Results PARSEC L2 with 64 threads v Fully Associative Odd-set, 1 bank Direct Mapped, 1 bank

Prior Work • Prime number of banks/sets thought ideal • No efficient implementation • Mersenne Primes not so convenient: • 3, 7, 15, 31, 63, 127, 255 • We demonstrated wasn’t an issue • Yang, ISCA ‘92 - prime strides for vector computers • Showed 3x speedup • We get correct offset for free • Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well • Our implementation faster, more features • Couldn’t appreciate benefits for SPEC

(Re)placement Policies • Not all data should be cached • Recent papers for LLC caches • Hard drive cache algorithms • Frequency over Recency • Frequency hard to implement • ARC good compromise • Direct Mapping Replacement dominates • Look for explicit approaches • Priority Classes • Epochs

Prior Work • Belady – solved it all • Three hierarchies of methods • Best utilized information of prior line usage • Light on implementation details • Approximations • Hallnor & Reinhardt, ISCA 2000 • Generational Replacement • Meggido, Usenet 2003, ARC cache • ghost entries • recencyand frequency groups • Qureshi, 2006, 2007 – Adaptive Insertion policies • Multiqueue, LR-K, D-NUCA, etc.

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

Benchmarks • Initially studied regular HPC kernels/applications in CMP environment • Dense Matrix Multiply • Fast Fourier Transform • Homme weather simulation • Added CUDA throughput benchmarks • Parboil – old school MPI, coarse grained • Rodinia – fine grained, varied • Benchmarks typical of historical GPGPU applications • Will add irregular benchmarks • SparseMM, Adaptive Finite Elements, Photon mapping

Subset of Benchmarks

Preliminary Results • Most of the benchmarks should benefit: • Small working sets • Concentrated working sets • Hit rate curves easy to predict

Typical Concentration of Locality

Scratchpad Task Locality

Hybrid Simulator Design C++/CUDA Simulate Different Architecture Than Traced NVCC PTX Intermediate Assembly Listing Dynamic Trace Blocks Attachment Points Modify Ocelot Functional Sim Custom Trace Module Compressed Trace Data Custom Simulator Goals: Fast simulation, Overcome compiler issues for reasonable base case

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

Phase 1 – HPC Applications • Looked at GEMM, FFT & Homme in CMP setting • Learned implementation algorithms, alternative algorithms • Expertise allows for credible throughput analysis • Valuable Lessons in multithreading and caching • Dense Matrix Multiply • Blocking to maximize arithmetic intensity • Enough contexts to cover latency • Fast Fourier Transform • Pathologically hard on memory system • Communication & synchronization • HOMME – weather modeling • Intra-chip scaling incredibly difficult • Memory system performance variation • Replacing data movement with computation • First author publications: • PPoPP 2008, ISPASS 2011 (Best Paper)

Phase 2 – Benchmark Characterization • Memory Access Characteristics of Rodinia and Parboil benchmarks • Apply Mathematical Analysis • Validate model • Find optimum operating points for benchmarks • Find optimum TA topology for benchmarks • NEARLY COMPLETE

Phase 3 – Evaluate Enhancements • Automatic Thread Throttling • Low latency hierarchical cache • Benefits of odd-sets/odd-banking • Benefits of explicit placement (Priority/Epoch) • NEED FINAL EVALUATION and explicit placement study

Final Phase – Extend Domain • Study regular HPC applications in throughput setting • Add at least two irregular benchmarks • Less likely to benefit from caching • New opportunities for enhancement • Explore impact of future TA topologies • Memory Cubes, TSV DRAM, etc.

Conclusion • Dissertation Goals: • Quantify the degree single thread performance effects throughput performance for an important class of applications • Improve parallel efficiency through thread scheduling, cache topology, and cache policies • Feasibility • Regular Benchmarks show promising memory behavior • Cycle accurate simulator nearly completed

Proposed Timeline • Phase 1 – HPC applications – completed • Phase 2 – Mathematical model & Benchmark Characterization • MAY-JUNE • Phase 3 – Architectural Enhancements • JULY-AUGUST • Phase 4 – Domain enhancement / new features • September-November

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures

Presentation Transcript

High-Throughput Computing on Commodity Systems.

On - Chip Communication Architectures

Architectures for Secure Systems

Designing On-chip Memory Systems for Throughput Architectures

Designing Memory Systems for Tiled Architectures

Throughput-Effective On-Chip Networks for Manycore Accelerators

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Shared-memory Architectures

A Hierarchical Modeling Framework for On-Chip Communication Architectures

Comparing Memory Systems for Chip Multiprocessors

A Fast On-Chip Profiler Memory

Shared memory architectures

Architectures for Distributed Systems

A Cluster-On-A-Chip Architecture For High Throughput Phylogeny Search

Designing Throughput Accountability

Photonic On-Chip Networks for Performance-Energy Optimized Off-Chip Memory Access

Architectures for Distributed Systems

Comparing Memory Systems for Chip Multiprocessors

A High Throughput Network-on-Chip Architecture for System-on-Chip Interconnect

Unlock chip memory