Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta †

Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing Abbas Rahimi†, A. Ghofrani‡, M. A. Montano‡, K-T Cheng‡, L. Benini*, R. K. Gupta† †UCSD, ‡UCSB, EHTZ*, UNIBO* Micrel.deis.unibo.it /MultiTherman Variability.org

Energy-Efficient GPGPU ✓SIMD  × conservative guardbands loss of operational efficiency  Total delay: corner + 3σ stochastic delay guardband Kakoee et al, TCAS-II’12 Thousands of deep and wide pipelines make GPGPU high power consuming parts NT and VOS achieve energy efficiency at costs to • Performance loss • Increasing timing sensitivity in the presence of variations

Variability is about Cost and Scale Eliminating guardband  Timing error  Bowman et al, JSSC’09 error rate × wider width Costly error recovery for SIMD  Wide lanes Recovery cycles increases linearly with pipeline length quadratically expensive Deep pipes

Taxonomy of SIMD Variability-Tolerance Guardband Eliminating Adaptive Timing error No timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Exact / approximate computing Exact computing Predict & prevent Independent recovery Memoization Recalling recent context of error-free execution (approximately / exactly) Lane decoupling through private queues Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Rahimi et al, TCAS’13 Rahimi et al, DATE’14 Detect-then-correct

Contributions Efficient spatiotemporal reuse of computation in GPGPUs by collaborative • Micro-architectural design • An associative memristive memory (AMM) module is integrated with FPUs − representing partial functionality • Compiler profiling • Fine-grained partitioning of values (searching space of possible inputs) • Pre- storing high-frequent sets of values in AMM modules Ensure their resiliency under voltage overscaling for Evergreen GPGPUs

Collaborative compilation framework and memristive-based computing Training datasets OpenCL Kernel Profiler • Profiling Highly frequent computations one-off activity Customized clCreateBuffer to insert AMM contents 2) Code generation AMM contents Kernel programming lunching kernel FPU AMM 3) Runtime =?

AMM with FPU Error  No Recovery  Returnpre-stored result Search Operands TCAM: a self-referenced sensing scheme†, 2-bit encoding, 15% positive slack at 45nm Memory block: avoids read disturbance Ternary content addressable memory (TCAM) Crossbar-based 1T-1R memristive memory block †Li et al, JSSC’14 AMM: Software programmable Mimics partial functionality of FPU Two pipelined stages

OpenCL Sobel AMM Hit Rates Profiler +: {a, b} → {q} *: {a, b} → {q} √ : {a} → {q} … train offline test1 Programming before lunching kernel FPU+ AMM+ test2 FPU* AMM* FPU√ AMM√ … test3 runtime test4

Efficiency under Voltage Overscaling 19% 30% 17% 28% 36% 33% 32% 33% 37% 39% Reduce timing errors from 38% to 24% 28% 29% At 1.0V, without any timing error, 36% average energy saving (7 kernels) • At 0.88V, on average 39% energy saving

Conclusion Static compiler analysis and coordinated microarchitectural design that enable efficient reuse of computations in GPGPUs Emerging associativememristive modules are coupled with FPU for fast spatial and temporal reuse GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% on the 32-entry AMMs

Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta †

Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta †

Presentation Transcript

T O U L K Y P A R K Y A Z A H R A D A M I

T e a m w o r k

Abbas Rahimi, Andrea Marongiu , Rajesh K. Gupta, Luca Benini

T-E-A-M-W-O-R-K