Liquid SIMD: Dynamic Mapping for Efficient SIMD Hardware Utilization
Liquid SIMD presents a novel approach to abstract SIMD accelerators, enabling efficient utilization of hardware by dynamically mapping computation without static control limitations. This method enhances compatibility, reduces overhead, and supports various hardware features while maintaining efficient computational efficiency within low power envelopes. By virtualizing SIMD operations through scalar ISA representations, Liquid SIMD ensures forward and backward compatibility, facilitating seamless binary migration and maximizing performance across diverse computing architectures.
Liquid SIMD: Dynamic Mapping for Efficient SIMD Hardware Utilization
E N D
Presentation Transcript
Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia*, Krisztián Flautner* University of Michigan *ARM Ltd. 1
Computational Efficiency • Low power envelope • More useful work/transistors • Hardware accelerators • Niagara II encryption engine Source: AMD Analyst Day 12/14/06 2
Program Accel. CPU How Are Accelerators Used? Control statically placed in binary 3
Program Accel. Accel. CPU CPU Problem With Static Control Not forward/backward compatible CPU 4
Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Solution: Virtualization • Statically identify accelerated computation • Abstract accelerator features • Dynamically retarget binary 5
Liquid SIMD • Virtualize SIMD accelerators • Why virtualize SIMD? • Intel MMX to SSE2 • ARM v6 to Neon • Wide vectors useful [Lin 06] 6
SIMD Accelerator Assumptions • Same instruction stream • Separate pipeline – memory interface SIMD Exec Decode Fetch Retire Scalar Exec 7
How to Virtualize • Use scalar ISA to represent SIMD operations • Compatibility, low overhead • Key: easy to translate Program Branch 8
uCode Cache Accel. Fetch Retire Trans. Execute Decode Virtualization Architecture 9
A A A B B B + + + & & & 1. Data Parallel Operations for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4; } C 10
A B SADD 1a. What If There’s No Scalar Equivalent? for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ... } Idioms can always be constructed 11
+ + + & & & 2. Scalarizing Permutations for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const … } offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} 12
+ 3. Scalarizing Reductions for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … } 13
v3 v2 1 0 1 3 v1 Mem v1 Applied to ARM Neon • All instructions supported except… • VTBL – indirect indexing v1 = vtbl v2, v3 • Interleaved memory accesses • Not needed in evaluated benchmarks 14
Translation to SIMD • Update induction variable • Use inverse of defined translation rules for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i]; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3; } i += 4 for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } 15
Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Translator Design Translator: efficiency, speed, flexibility 16
Evaluation • Trimaran ARM • Hand SIMDized loops • SimpleScalar model ARM926 w/ Neon SIMD • VHDL translator, 130nm std. cell 17
Liquid SIMD Issues • Code bloat • <1% overhead beyond baseline • Register pressure • Not a problem • Translator cost • 0.2 mm2 + 2KB cache • Translation overhead 18
Translation Overhead MediaBench Kernels SPECfp 19
Summary • Accelerators are more common and evolving • Costly binary migration • SIMD virtualization using scalar ISA • One binary: forward/backward compatibility • Negligible overhead 20
Questions ? ? ? ? ? ? ? ? ? ? ? ? 21