Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

Programming High Performance Embedded Systems:Tackling the Performance Portability Problem Alastair Reid Principal Engineer, R&D ARM Ltd

Programming HP Embedded Systems High-Performance Energy-Efficient Hardware • Example: Ardbeg processor cluster (ARM R&D) Portable System-level programming • Example: SoC-C language extensions (ARM R&D) Portable Kernel-level programming • Example: C+Builtins • Example: Data Parallel Language • Merging System/Kernel-level programming

Mobile Consumer Electronics Trends Mobile Application Requirements Still Growing Rapidly • Still cameras: 2Mpixel  10 Mpixel • Video cameras: VGA  HD 1080p  … • Video players: MPEG-2  H.264 • 2D Graphics: QVGA  HVGA  VGA  FWVGA  … • 3D Gaming: > 30Mtriangle/s, antialiasing, … • Bandwidth: HSDPA (14.4Mbps)  WiMax (70Mbps)  LTE (326Mbps) Feature Convergence • Phone • + graphics + UI + games • + still camera + video camera • + music • + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS • + …

5 Mobile SDR Design Challenges SDR Design Objectives for 3G and WiFi • Throughput requirements • 40+Gops peak throughput • Power budget • 100mW~500mW peak power Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008 5

Energy Efficient Systems are “Lumpy” Drop Frequency 10x • Desktop: 2-4GHz • Mobile: 200-400MHz Increase Parallelism 100x • Desktop: 1-2 cores • Mobile: 32-way SIMD Instruction Set, 4-8 cores Match Processor Type to Task • Desktop: homogeneous, general purpose • Mobile: heterogeneous, specialised Keep Memory Local • Desktop: coherent, shared memory • Mobile: processor-memory clusters linked by DMA

10 Ardbeg PE 1 . wide SIMD L 2 Ardbeg System Memory 1024 - bit 512 - bit E W SIMD SIMD 3 . Memory FEC X B ACC RF Mult Accelerator 512 - bit 512 - bit SIMD PE SIMD E W ALU L 1 t Reg . X B Execution I I c s with e u Mem File N N Unit n B L 1 shuffle n T T L 2 o Data Pred . c E E SIMD r Mem Memory t RF e i R R t PE Shuffle b E W SIMD n - C C L 1 I 2 X B Net - Execution Pred . I 1 O O X Mem 5 work ALU Unit A N N 3 N N A E E B SIMD + SIMD C C M Scalar A wdata T T Transf L 1 Control t S S i Unit b Mem Processor - L 1 2 . Scalar & AGU 4 6 Program Scalar Scalar E W Memory ALU + wdata X B Mult DMAC Scalar AGU RF + ACC AGU Controller Peripherals AGU AGU RF Ardbeg SDR Processor Sparse Connected VLIW Application Specific Hardware 8,16,32 bit fixed point support 512-bit SIMD 2-level memory hierarchy Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

11 100 802.11a 802.11a 802.11a 180nm 802.11a 802.11a 10 DVB-H W-CDMA 2Mbps Ardbeg DVB-T SODA W-CDMA 2Mbps ASIC Achieved Throughput (Mbps) 180nm W-CDMA 2Mbps W-CDMA 2Mbps 1 W-CDMA 2Mbps Sandblaster W-CDMA data TigerSHARC W-CDMA data W-CDMA data 7 Pentium M 0.1 W-CDMA voice W-CDMA voice 0.01 0.01 0.1 1 10 100 1000 Power (Watts) Summary of Ardbeg SDR Processor • Ardbeg is lower power at same throughput • We are getting closer to ASICs Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

How do we program AMP systems? C doesn’t provide language features to support • Multiple processors (or multi-ISA systems) • Distributed memory • Multiple threads

Use Indirection (Strawman #1) Add a layer of indirection • Operating System • Layer of middleware • Device drivers • Hardware support All impose a cost in Power/Performance/Area

Raise Pain Threshold (Strawman #2) Write efficient code at very low level of abstraction Problems • Hard, slow and expensive to write, test, debug and maintain • Design intent drowns in sea of low level detail • Not portable across different architectures • Expensive to try different points in design space

Our Response Extend C • Support Asymmetric Multiprocessors • SoC-C language raises level of abstraction • … but take care not to hide expensive operations

SoC-C Overview Pocket-Sized Supercomputers • Energy efficient hardware is “lumpy” • … and unsupported by C • … but supported by SoC-C SoC-C Extensions by Example • Pipeline Parallelism • Code Placement • Data Placement SoC-C Conclusion

3 steps in mapping an application • Decide how to parallelize • Choose processors for each pipeline stage • Resolve distributed memory issues

int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } A Simple Program

Simplified System Architecture Artist’s impression SIMD Instruction Set Data Engines Control Processor Accelerators Distributed Memories

int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } Step 1: Decide how to parallelize 50% of work 50% of work

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } Step 1: Decide how to parallelize PIPELINE indicates region to parallelize FIFO indicates boundaries between pipeline stages

SoC-C Feature #1: Pipeline Parallelism Annotations express coarse-grained pipeline parallelism • PIPELINE indicates scope of parallelism • FIFO indicates boundaries between pipeline stages Compiler splits into threads communicating through FIFOs

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } Step 2: Choose Processors

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Step 2: Choose Processors @ P indicates processor to execute function

SoC-C Feature #2: RPC Annotations Annotations express where code is to execute • Behaves like Synchronous Remote Procedure Call • Does not change meaning of program • Bulk data is not implicitly copied to processor’s local memory

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Step 3: Resolve Memory Issues P0 uses x  x must be in M0 P1 uses z  z must be in M1 P0 uses y  y must be in M0 Conflict?! P1 uses y  y must be in M1

Hardware Cache Coherency P1 P0 $1 $0 invalidate x copy x invalidate x write x read x write x

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Step 3: Resolve Memory Issues Two versions: y@M0, y@M1 write y@M0  y@M1 is invalid reads y@M1  Coherence error

int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Step 3: Resolve Memory Issues SYNC(x) @ P copies data from one version of x to another using processor P  y@M1 and y@M0 are valid read y@M1

SoC-C Feature #3: Compile Time Coherency Variables can have multiple coherent versions • Compiler uses memory topology to determine which version is being accessed Compiler applies cache coherency protocol • Writing to a version makes it valid and other versions invalid • Dataflow analysis propagates validity • Reading from an invalid version is an error • SYNC(x) copies from valid version to invalid version

Compiling SoC-C See paper: SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (CASES) 2008. (Or view ‘bonus slides’ after talk.)

More realistic SoC-C code DVB-T Inner Receiver • OFDM receiver • 20 tasks • 500-7000 cycles each • 29000 cycles total adc_t adc; ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13); SOCC_PIPELINE { ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd; for(int sym = 0; sym<LOOPS; ++sym) { cbuffer_t src_r, src_i; unsigned len = Nguard+asC_MODE[Mode]; ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i); align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512; align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512; ADC_ReleaseRoom(&adc,&src_r,&src_i,len); RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa; cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa; fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa; SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb; DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum, demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb; DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum, demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb; cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc; cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb; DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb; DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb; cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc; int Direction = (ISymNum & 1); Direction ^= 1; if (Direction) { Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr, DE_vinterleave_symbol_addr_DVB_T_N, DE_vinterleave_symbol_addr_DVB_T_OFFSET, TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc; pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc; } else { unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc; Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag, DE_vinterleave_symbol_addr_DVB_T_N, DE_vinterleave_symbol_addr_DVB_T_OFFSET, TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc; } ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd; Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd; BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd; uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd; Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM; p += uMaxCnt/8; ISymNum = (ISymNum+1) % 4; } ADC_Fini(&adc);

Efficient Same performance as hand-written code Near Linear Speedup Very efficient use of parallel hardware Parallel Speedup

What SoC-C Provides SoC-C language features • Pipeline to support parallelism • Coherence to support distributed memory • RPC to support multiple processors/ISAs Non-features • Does not choose boundary between pipeline stages • Does not resolve coherence problems • Does not allocate processors SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf)

Related Work Language • OpenMP: SMP data parallelism using ‘C plus annotations’ • StreamIt: Pipeline parallelism using dataflow language Pipeline parallelism • J.E. Smith, “Decoupled access/execute computer architectures,” Trans. Computer Systems, 2(4), 1984 • Multiple independent reinventions Hardware • Woh et al., “From Soda to Scotch: The Evolution of a Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008

More Recent Related Work Mapping applications onto Embedded SoCs • Exposing Non-Standard Architectures to Embedded Software using Compile-Time Virtualization, CASES 2009 Pipeline parallelism • The Paralax Infrastructure: Automatic Parallelization with a Helping Hand, PACT 2010

The SoC-C Model Program as if using SMP system • Single multithreaded processor: RPCs provide a “Migrating thread Model” • Single memory: Compiler Managed Coherence handles “bookkeeping” • Annotations change execution, not semantics Avoid need to restructure code • Pipeline parallelism • Compiler managed coherence Efficiency • Avoid abstracting expensive operations  programmer can optimize and reason about

Kernel Programming

Overview Example: FIR filter Hand-vectorized code • Optimal performance • Issues An Alternative Approach

Example Vectorized Code Very fast, efficient code • Uses 32-wide SIMD • Each SIMD multiply performs 32 (useful) multiplies • VLIW compiler overlaps operations • 3 vector operations per cycle • VLIW compiler performs software pipelining • Multiplier active on every cycle void FIR(vint16_t x[], vint16_t y[], int16_t h[]) { vint16_t v = x[0]; for (inti=0; i<N/SIMD_WIDTH; ++i) { vint16_t w = x[i+1]; vint32L_t acc = vqdmull(v,h[0]); s = vget_lane(w,0); v = vdown(v,s); for(int j=1; j<T-1; ++j) { acc = vqdmlal(acc,v,h[j]); s = vget_lane(w,j); v = vdown(v,s); } y[i] = vqrdmlah(acc,v,h[j]); v = w; } }

Portability Issues Vendor specific SIMD operations • vqdmull, vdown, vget_lane SIMD-width specific • Assumes SIMD_WIDTH >= T Doesn’t work/performs badly on • Many SIMD architectures • GPGPU • SMP

Flexibility issues Improve arithmetic intensity • Merge with adjacent kernel • E.g., if filtering input to FFT, combine with bit reversal Parallelize task across two Ardbeg engines • Requires modification to system-level code

Summary Programming directly to the processor • Produces very high performance code • Kernel is not portable to other processor types • Kernels cannot be remapped to other devices • Kernels cannot be split/merged to improve scheduling or reduce inter-kernel overheads Often produces local optimum But misses global optimum

(Towards)Performance-Portable Kernel Programming

Outline The goal Quick and dirty demonstration References to (more complete) versions What still needs to be done

An alternative approach Compiler

A simple data parallel language N loop(N) { V1 = load(a); V2 = load(b); V3 = add(V1,V2); store(c,V3); } b0 b1 b2 b3 b4 b5 b6 b7 ... a0 + b0 a0 + b0 a1 + b1 a1 + b1 a2 + b2 a2 + b2 a3 + b3 a3 + b3 a4 + b4 a4 + b4 a5 + b5 a5 + b5 a6 + b6 a6 + b6 a7 + b7 a7 + b7 ... ... a0 a1 a2 a3 a4 a5 a6 a7 ... V1: V2: V3: c: * Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.

Compiling Vector Expressions p1=a; p2=b; p3=c; for(i=0; i<N; i+=32) { V1=vld1(p1); V2=vld1(p2); V3=vadd(V1,V2); vst1(p3,V3); p1+=32; p2+=32; p3+=32; }

Generating datapath +1 +1 +1 Mem A + Mem C Mem B * Warning: this circuit does not adhere to any ARM quality standards.

Adding control +1 +1 +1 -1 Mem A + Mem C en Mem B en en en !=0 nDone * Warning: this circuit does not adhere to any ARM quality standards.

Fixing timing +1 +1 +1 -1 Mem A + Mem C en Mem B en en en !=0 nDone

Related Work NESL – Nested Data Parallelism (CMU) • Cray Vector machines, Connection Machine DpH – Generalization of NESL as a Haskell library (SLPJ++) • GPGPU Accelerator – Data parallel library in C#/F# (MSR) • SMP, DirectX9, FPGA Array Building Blocks – C++ template library (Intel) • SMP, SSE Thrust – C++ template library (NVidia) • GPGPU (Also: Parallel Skeletons, Map-reduce, etc. etc.)

Summary of approach (Only) Use highly structured bulk operations • Bulk operations  reason about vectors, not individual elements • Simple mathematical properties  easy to optimize Single frontend, multiple backends • SIMD, SMP, GPGPU, FPGA, ... (Scope for significant platform-dependent optimization)

Breaking down boundaries Hard boundary between system and kernel layers • Separate languages • Separate tools • Separate people writing/optimizing Need to soften boundary • Allow kernels to be split across processors • Allow kernels to be merged across processors • Allow kernels A and B to agree to use a non-standard memory layout (to run more efficiently) (This is an open problem)

Programming High Performance Embedded Systems: Tackling the Performance Portability Problem