High Performance Mobile Computing Using Flexible Wide SIMD Processors

High Performance Mobile Computing Using Flexible Wide SIMD Processors Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan

The Modern Mobile Phone The Old Mobile Phone • Future phones are becoming more complex • Richer applications require both more performance and more flexibility • Modern phones look like Franken-chips Video Recording Video Editing Higher Data Rates 3D Rendering Advanced Image Processing

Power/Performance Requirements for Multiple Systems 10000 W m / s p o M 0 0 0 4 G Wireless 1 ) s IBM Cell p 1000 W o m / s G p o ( M 0 W 0 m 1 / e s p c o M 0 n 1 Mobile HD SODA a 100 m Video ( 90 nm ) r P SODA o o Imagine f ( 65 nm ) w r B W e e 3 G Wireless m / r e s P p t o M E t 1 e f f r 10 i c i VIRAM Pentium M e n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) 3

4G Wireless Basics NTT DoCoMo 4G test setup • Three kernels make up the majority of the work • FFT – Extract Data from Signals • STBC – Combine Data into More Reliable Stream • LDPC – Error Correction on Data Stream 4

High Definition Video (H.264) Basics 4CIF@30fps 5

Mobile Signal Processing Algorithm Characteristics • Problems with traditional SIMD • High register file power • Large data movement/alignment cost • Inconsistent lane utilization • SIMD implies single thread • Algorithms have different SIMD widths • From very large to very small • Though SIMD width varies all algorithms can exploit it • Large percentage of work can be SIMDized • Larger SIMD width tend to have less TLP 6

So, What’s the Right Solution? • Alternatives • More processors, less lanes? • Configurable: Hardware can be SIMD or MIMD? • Franken chip? • SIMD is the answer! It provides high performance and power efficiency • Low control cost • More area-efficient scaling • Single thread context • Simpler memory system design –cache coherence more manageable

A Closer Look at SIMD: Power Breakdown SODA WCDMA SCALAR PIPE MEM 9% 3% INTERCONNECT 2% MEM REG REG 37% ALU+MULT CONTROL INTERCONNECT CONTROL SCALAR PIPE 38% ALU+MULT 11% Register file power disproportionately high in a traditional SIMD architecture

Register File Accesses Lots of power wasted on unneeded register file access! Many of the register file access do not have to go back to the main register file 9

LDPC – Scaling Performance with SIMD Width Extra hardware power consumption outweighs reduced operations Increasing SIMD width reduces Memory and Shuffle Operations thus reduces power • SIMD loses effectiveness when lanes cannot be put to productive use • SIMD on distributed data (SIMdD) • Efficient data rearrangement critical to success of SIMD 10

Data Alignment Issues Intra-Prediction Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel • H.264 Intra-prediction has 9 different prediction modes • Each prediction mode requires a specific permutation

4G/H.264 Summary Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported TLP (disjoint SIMD) often available Very short-lived values Lots of potential for instruction fusings (beyond pairwise) Limited set of shuffle patterns required for each kernel

AnySP: Push SIMDBut, Increase the Inherent Flexibility and Efficiency

AnySP Architecture – High Level 16 Banked Memory with SRAM-based Crossbar 8 Groups of 8-Wide Flexible Function Units Multiple Output Adder Tree 128x128 16bit Swizzle Network Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline

Multi-Width SIMD Support Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

Using SIMD Lanes for Deeper Subgraphs Flexible Functional Unit allows us to Exploit Pipeline-parallelism by joining two lanes together Handle register bypass and the temporary buffer Join multiple pipelines to process deeper subgraphs Fuse Instruction Pairs

SRAM-based Crossbar Multiple SRAM cells replace MUX of traditonal crossbar Each cell stores configuration information The controller selects the specific configuration based on the instruction parameter Each cell can store up to 6 different configurations Power reduced by 50% for 128x128 crossbar

AnySP vs SIMD-based Architecture Baseline 64 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass 2 . 5 p u d 2 . 0 e e p S 1 . 5 d e z i l 1 . 0 a m r o 0 . 5 N 0 . 0 FFT 1024 pt FFT 1024 pt STBC LDPC H . 264 H . 264 H . 264 H . 264 Radix - 2 Radix - 4 Intra Deblocking Inverse Motion Prediction Filter Transform Compensation SIMD width doubled But that only provides half the performance gain, other half due to flexibility features

AnySP Energy-Delay vs SIMD-based Architecture SIMD - based Architecture AnySP y a l 1 . 0 e D - y 0 . 8 g r e n 0 . 6 E d 0 . 4 e z i l a 0 . 2 m r o 0 . 0 N FFT 1024 pt FFT 1024 pt STBC LDPC H . 264 H . 264 H . 264 H . 264 Radix - 2 Radix - 4 Intra Deblocking Inverse Motion Compenstation Prediction Filter Transform Comparison based on 90nm synthesis results Flexibility increases utilization of datapath and hence its efficiency

AnySP Power Breakdown Area 4 G + H . 264 Decoder Area Area Power Power Components Units 2 mm % mW % SIMD Data Mem ( 32 KB ) 4 9 . 76 38 . 78 % 102 . 88 7 . 24 % SIMD Register File ( 16 x 1024 bit ) 4 3 . 17 12 . 59 % 299 . 00 21 . 05 % SIMD ALUs , Multipliers , and SSN 4 4 . 50 17 . 88 % 448 . 51 31 . 58 % SIMD Pipeline + Clock + Routing 4 1 . 18 4 . 69 % 233 . 60 16 . 45 % PE SIMD Buffer ( 128 B ) 4 0 . 82 3 . 25 % 84 . 09 5 . 92 % SIMD Adder Tree 4 0 . 18 < 1 % 10 . 43 < 1 % Intra - processor Interconnect 4 0 . 94 3 . 73 % 93 . 44 6 . 58 % Scalar / AGU Pipeline & Misc . 4 1 . 22 4 . 85 % 134 . 32 9 . 46 % ARM ( Cortex - M 3 ) 1 0 . 6 2 . 38 % 2 . 5 < 1 % System Global Scratchpad Memory ( 128 KB ) 1 1 . 8 7 . 15 % 10 < 1 % Inter - processor Bus with DMA 1 1 . 0 3 . 97 % 1 . 5 < 1 % Total 90 nm ( 1 V @ 300 MHz ) 25 . 17 100 % 1347 . 03 100 % 65 nm ( 0 . 9 V @ 300 MHz ) 13 . 14 1091 . 09 Est . 45 nm ( 0 . 8 V @ 300 MHz ) 6 . 86 862 . 09 We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

Conclusions • Scaling traditional SIMD for mobile applications • Wide-SIMD hardware under-utilized • Large fraction of power on non-computation • AnySP design • Can possibly meet the requirements of 100Mbps 4G and HD video on the same platform @45nm • Flexibility/Efficiency improvements • Increase SIMD utilization (FFUs, multiple short vectors) • Reduce register file power (bypass buffer) • More efficient data shuffling (SRAM-based crossbar) 21

Questions • For more information • http://cccp.eecs.umich.edu

High Performance Mobile Computing Using Flexible Wide SIMD Processors

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Presentation Transcript

High Performance Sorting and Searching using Graphics Processors

HIGH PERFORMANCE COMPUTING

High Performance Computing

High-Performance Computing

High-Performance Computing

Flexible Filters for High Performance Embedded Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High-Performance Computing

High Performance Computing

High Performance Processors and Systems

High Performance Computing

Advanced Topic: High Performance Processors

High Performance Computing

HIGH-PERFORMANCE COMPUTING

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *

High Performance Computing

High Performance Processors

HIGH PERFORMANCE COMPUTING

Flexible Filters for High Performance Embedded Computing

High Performance Computing

Mobile Agents in High Performance Computing System