Liquid SIMD: Dynamic Mapping for Efficient SIMD Hardware Utilization

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia*, Krisztián Flautner* University of Michigan *ARM Ltd. 1

Computational Efficiency • Low power envelope • More useful work/transistors • Hardware accelerators • Niagara II encryption engine Source: AMD Analyst Day 12/14/06 2

Program Accel. CPU How Are Accelerators Used? Control statically placed in binary 3

Program Accel. Accel. CPU CPU Problem With Static Control Not forward/backward compatible CPU 4

Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Solution: Virtualization • Statically identify accelerated computation • Abstract accelerator features • Dynamically retarget binary 5

Liquid SIMD • Virtualize SIMD accelerators • Why virtualize SIMD? • Intel MMX to SSE2 • ARM v6 to Neon • Wide vectors useful [Lin 06] 6

SIMD Accelerator Assumptions • Same instruction stream • Separate pipeline – memory interface SIMD Exec Decode Fetch Retire Scalar Exec 7

How to Virtualize • Use scalar ISA to represent SIMD operations • Compatibility, low overhead • Key: easy to translate Program Branch 8

uCode Cache Accel. Fetch Retire Trans. Execute Decode Virtualization Architecture 9

A A A B B B + + + & & & 1. Data Parallel Operations for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4; } C 10

A B SADD 1a. What If There’s No Scalar Equivalent? for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ... } Idioms can always be constructed 11

+ + + & & & 2. Scalarizing Permutations for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const … } offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} 12

+ 3. Scalarizing Reductions for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … } 13

v3 v2 1 0 1 3 v1 Mem v1 Applied to ARM Neon • All instructions supported except… • VTBL – indirect indexing v1 = vtbl v2, v3 • Interleaved memory accesses • Not needed in evaluated benchmarks 14

Translation to SIMD • Update induction variable • Use inverse of defined translation rules for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i]; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3; } i += 4 for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } 15

Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Translator Design Translator: efficiency, speed, flexibility 16

Evaluation • Trimaran ARM • Hand SIMDized loops • SimpleScalar model ARM926 w/ Neon SIMD • VHDL translator, 130nm std. cell 17

Liquid SIMD Issues • Code bloat • <1% overhead beyond baseline • Register pressure • Not a problem • Translator cost • 0.2 mm2 + 2KB cache • Translation overhead 18

Translation Overhead MediaBench Kernels SPECfp 19

Summary • Accelerators are more common and evolving • Costly binary migration • SIMD virtualization using scalar ISA • One binary: forward/backward compatibility • Negligible overhead 20

Questions ? ? ? ? ? ? ? ? ? ? ? ? 21

Liquid SIMD: Dynamic Mapping for Efficient SIMD Hardware Utilization

Liquid SIMD: Dynamic Mapping for Efficient SIMD Hardware Utilization

Presentation Transcript

Presentation Outline

Break it, Use it!

New Algorithms for SIMD Alignment

Chapter 5 Array Processors

SIMD Processor Extensions

Time Optimization of HEVC Encoder over X86 Processors using SIMD

SIMD Parallelization of Applications that Traverse Irregular Data Structures

Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability

A quick run through… What is SIMD 2009? Some things you should know Some findings

Photon Mapping on Programmable Graphics Hardware

Lecture 18

Hardware programming

Dealer Name Street Address City, State Zip

CS4961 Parallel Programming Lecture 11: SIMD, cont. Mary Hall September 29, 2011

Dynamic Mapping of Activation Trees Thesis Proposal January 29, 1998

Models of Parallel Computation

Find the SIMD rank for an area of interest using the interactive mapping software

by : Justin C. Miller

Relational Verification to SIMD Loop Synthesis