1 / 22

High Performance Mobile Computing Using Flexible Wide SIMD Processors

High Performance Mobile Computing Using Flexible Wide SIMD Processors. Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan.

hestia
Télécharger la présentation

High Performance Mobile Computing Using Flexible Wide SIMD Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Mobile Computing Using Flexible Wide SIMD Processors Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan

  2. The Modern Mobile Phone The Old Mobile Phone • Future phones are becoming more complex • Richer applications require both more performance and more flexibility • Modern phones look like Franken-chips Video Recording Video Editing Higher Data Rates 3D Rendering Advanced Image Processing

  3. Power/Performance Requirements for Multiple Systems 10000 W m / s p o M 0 0 0 4 G Wireless 1 ) s IBM Cell p 1000 W o m / s G p o ( M 0 W 0 m 1 / e s p c o M 0 n 1 Mobile HD SODA a 100 m Video ( 90 nm ) r P SODA o o Imagine f ( 65 nm ) w r B W e e 3 G Wireless m / r e s P p t o M E t 1 e f f r 10 i c i VIRAM Pentium M e n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) 3

  4. 4G Wireless Basics NTT DoCoMo 4G test setup • Three kernels make up the majority of the work • FFT – Extract Data from Signals • STBC – Combine Data into More Reliable Stream • LDPC – Error Correction on Data Stream 4

  5. High Definition Video (H.264) Basics 4CIF@30fps 5

  6. Mobile Signal Processing Algorithm Characteristics • Problems with traditional SIMD • High register file power • Large data movement/alignment cost • Inconsistent lane utilization • SIMD implies single thread • Algorithms have different SIMD widths • From very large to very small • Though SIMD width varies all algorithms can exploit it • Large percentage of work can be SIMDized • Larger SIMD width tend to have less TLP 6

  7. So, What’s the Right Solution? • Alternatives • More processors, less lanes? • Configurable: Hardware can be SIMD or MIMD? • Franken chip? • SIMD is the answer! It provides high performance and power efficiency • Low control cost • More area-efficient scaling • Single thread context • Simpler memory system design –cache coherence more manageable

  8. A Closer Look at SIMD: Power Breakdown SODA WCDMA SCALAR PIPE MEM 9% 3% INTERCONNECT 2% MEM REG REG 37% ALU+MULT CONTROL INTERCONNECT CONTROL SCALAR PIPE 38% ALU+MULT 11% Register file power disproportionately high in a traditional SIMD architecture

  9. Register File Accesses Lots of power wasted on unneeded register file access! Many of the register file access do not have to go back to the main register file 9

  10. LDPC – Scaling Performance with SIMD Width Extra hardware power consumption outweighs reduced operations Increasing SIMD width reduces Memory and Shuffle Operations thus reduces power • SIMD loses effectiveness when lanes cannot be put to productive use • SIMD on distributed data (SIMdD) • Efficient data rearrangement critical to success of SIMD 10

  11. Data Alignment Issues Intra-Prediction Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel • H.264 Intra-prediction has 9 different prediction modes • Each prediction mode requires a specific permutation

  12. 4G/H.264 Summary Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported TLP (disjoint SIMD) often available Very short-lived values Lots of potential for instruction fusings (beyond pairwise) Limited set of shuffle patterns required for each kernel

  13. AnySP: Push SIMDBut, Increase the Inherent Flexibility and Efficiency

  14. AnySP Architecture – High Level 16 Banked Memory with SRAM-based Crossbar 8 Groups of 8-Wide Flexible Function Units Multiple Output Adder Tree 128x128 16bit Swizzle Network Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline

  15. Multi-Width SIMD Support Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

  16. Using SIMD Lanes for Deeper Subgraphs Flexible Functional Unit allows us to Exploit Pipeline-parallelism by joining two lanes together Handle register bypass and the temporary buffer Join multiple pipelines to process deeper subgraphs Fuse Instruction Pairs

  17. SRAM-based Crossbar Multiple SRAM cells replace MUX of traditonal crossbar Each cell stores configuration information The controller selects the specific configuration based on the instruction parameter Each cell can store up to 6 different configurations Power reduced by 50% for 128x128 crossbar

  18. AnySP vs SIMD-based Architecture Baseline 64 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass 2 . 5 p u d 2 . 0 e e p S 1 . 5 d e z i l 1 . 0 a m r o 0 . 5 N 0 . 0 FFT 1024 pt FFT 1024 pt STBC LDPC H . 264 H . 264 H . 264 H . 264 Radix - 2 Radix - 4 Intra Deblocking Inverse Motion Prediction Filter Transform Compensation SIMD width doubled But that only provides half the performance gain, other half due to flexibility features

  19. AnySP Energy-Delay vs SIMD-based Architecture SIMD - based Architecture AnySP y a l 1 . 0 e D - y 0 . 8 g r e n 0 . 6 E d 0 . 4 e z i l a 0 . 2 m r o 0 . 0 N FFT 1024 pt FFT 1024 pt STBC LDPC H . 264 H . 264 H . 264 H . 264 Radix - 2 Radix - 4 Intra Deblocking Inverse Motion Compenstation Prediction Filter Transform Comparison based on 90nm synthesis results Flexibility increases utilization of datapath and hence its efficiency

  20. AnySP Power Breakdown Area 4 G + H . 264 Decoder Area Area Power Power Components Units 2 mm % mW % SIMD Data Mem ( 32 KB ) 4 9 . 76 38 . 78 % 102 . 88 7 . 24 % SIMD Register File ( 16 x 1024 bit ) 4 3 . 17 12 . 59 % 299 . 00 21 . 05 % SIMD ALUs , Multipliers , and SSN 4 4 . 50 17 . 88 % 448 . 51 31 . 58 % SIMD Pipeline + Clock + Routing 4 1 . 18 4 . 69 % 233 . 60 16 . 45 % PE SIMD Buffer ( 128 B ) 4 0 . 82 3 . 25 % 84 . 09 5 . 92 % SIMD Adder Tree 4 0 . 18 < 1 % 10 . 43 < 1 % Intra - processor Interconnect 4 0 . 94 3 . 73 % 93 . 44 6 . 58 % Scalar / AGU Pipeline & Misc . 4 1 . 22 4 . 85 % 134 . 32 9 . 46 % ARM ( Cortex - M 3 ) 1 0 . 6 2 . 38 % 2 . 5 < 1 % System Global Scratchpad Memory ( 128 KB ) 1 1 . 8 7 . 15 % 10 < 1 % Inter - processor Bus with DMA 1 1 . 0 3 . 97 % 1 . 5 < 1 % Total 90 nm ( 1 V @ 300 MHz ) 25 . 17 100 % 1347 . 03 100 % 65 nm ( 0 . 9 V @ 300 MHz ) 13 . 14 1091 . 09 Est . 45 nm ( 0 . 8 V @ 300 MHz ) 6 . 86 862 . 09 We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

  21. Conclusions • Scaling traditional SIMD for mobile applications • Wide-SIMD hardware under-utilized • Large fraction of power on non-computation • AnySP design • Can possibly meet the requirements of 100Mbps 4G and HD video on the same platform @45nm • Flexibility/Efficiency improvements • Increase SIMD utilization (FFUs, multiple short vectors) • Reduce register file power (bypass buffer) • More efficient data shuffling (SRAM-based crossbar) 21

  22. Questions • For more information • http://cccp.eecs.umich.edu

More Related