From SODA to Scotch: The Evolution of a Wireless Baseband Processor

From SODA to Scotch: The Evolution of a Wireless Baseband Processor Mark Woh (University of Michigan - Ann Arbor) Yuan Lin (University of Michigan - Ann Arbor) Sangwon Seo (University of Michigan - Ann Arbor) Scott Mahlke (University of Michigan - Ann Arbor) Trevor Mudge (University of Michigan - Ann Arbor) Chaitali Chakrabarti (Arizona State University) Richard Bruce (ARM Ltd.) Danny Kershaw (ARM Ltd.) Alastair Reid (ARM Ltd.) Mladen Wilder (ARM Ltd.) Krisztian Flautner (ARM Ltd.)

From SODA to Scotch : What is this talk about? • If a fully programmable 3G baseband processor commercially viable? • The SODA processor was the first full research design [ISCA06] • ARM R&D developed the Ardbeg SDR commercial prototype • What we will present • Comparison study between SODA and Ardbeg • Lessons learned in the evolution 2 2

Cell phones are getting more complex PCs are getting more mobile Mobile Computing • In 2007, world-wide mobile telephone subscription: 3.3 billion1 • ~Half of the world’s population • Some countries have mobile penetration over 100% • Largest consumer electronic device in terms of volume • Wireless multimedia anywhere at anytime 1. “Global cellphone penetration reaches 50 pct”, Reuter, Nov. 29th, 2007 3 3

GPS DVB Global Network GSM W-CDMA Wide Area Network 802.11g 802.11n Local Area Network Bluetooth UWB Personal Area Network Wireless Communication 4 4

Camera GPS Bluetooth Keypad Application Processors WCDMA Display Analog Frontend Baseband Processor Speaker Microphone Software Defined Radio 5 5

GPP Transport Camera GPS Network Bluetooth Keypad Link Application Processors WCDMA Display MAC Analog Frontend Baseband Processor Speaker DSP + ASICs PHY Microphone Software Defined Radio 6 6

Camera GPS Bluetooth Keypad SDR Baseband Processor Application Processors WCDMA Display Analog Frontend Speaker Microphone Software Defined Radio 7 7

GPS 802.11n DVB GSM W-CDMA SDR 802.11g UWB Bluetooth Advantages of Soft Radio • Design factor • Protocol complexity • Multi-mode operation • Prototyping and bug fixes • Cost factor • Time-to-market • Silicon area • Higher volume • Longevity of platform 8 8

Mobile SDR Design Challenges SDR Design Objectives for 3G and WiFi • Throughput requirements • 40+Gops peak throughput • Power budget • 100mW~500mW peak power 9 9

First Generation SDR Processor : SODA • Our first attempt was the SODA processor • Design at 180nm technology • Built for WCDMA and 802.11a in mind • Sub 500mW operation estimated at 90nm 10

SODA PE 1 . wide SIMD To System Pred . 3 . Local Regs Bus memory 512 - bit W SIMD B ALU + 5 . DMA Mult 512 - bit L 1 SIMD SIMD SIMD Shuffle Reg . E W Net - Data X B File work Memory ( SSN ) RF DMA ALU SIMD to V S T Scalar T S ( VtoS ) V L 1 L 1 Program Scalar 2 . Scalar Memory Data Scalar Scalar E W Memory RF X B ALU Controller 4 . AGU AGU AGU E W RF X ALU B SODA System: Heterogeneous multi-core architecture Multi-level scratchpad memories PE: SIMD/Scalar/AGU LIW 32-lane 16-bit SIMD 16-bit scalar datapath Scalar-to-SIMD SIMD-to-scalar Iterative Perfect Shuffle Network 11 11

1000 P o w B e e r t t E e f W f ) r i m s c / s i p e p o n o M c G y 0 0 100 ( 1 e c W n m / a s p m o M r 0 o 1 f r e P W m k / 10 s a p o e M P 1 1 0 . 1 1 10 100 Power ( Watts ) SODA Summary Picochip 130nm Mobile SDR requirements SODA 90nm SODA 180nm Sandbridge 90nm TI C6x 90nm NXP EVP 90nm req. ASICs 12 12

Ardbeg PE 1 . wide SIMD L 2 Ardbeg System Memory 1024 - bit 512 - bit E W SIMD SIMD 3 . Memory FEC X B ACC RF Mult Accelerator 512 - bit 512 - bit SIMD PE SIMD E W ALU L 1 t Reg . X B Execution I I c s with e u Mem File N N Unit n B L 1 shuffle n T T L 2 o Data Pred . c E E SIMD r Mem Memory t RF e i R R t PE Shuffle b E W SIMD n - C C L 1 I 2 X B Net - Execution Pred . I 1 O O X Mem 5 work ALU Unit A N N 3 N N A E E B SIMD + SIMD C C M Scalar A wdata T T Transf L 1 Control t S S i Unit b Mem Processor - L 1 2 . Scalar & AGU 4 6 Program Scalar Scalar E W Memory ALU + wdata X B Mult DMAC Scalar AGU RF + ACC AGU Controller Peripherals AGU AGU RF Ardbeg SDR Processor Sparse Connected VLIW Application Specific Hardware Block Floating Point 3 Read/2 Write RF for VLIW 8,16,32 bit fixed point support Fused Permute ALU operations Combined Scalar/Vector Memory 128-lane 8-bit Banyan Network Multiple Data Address Accesses 13

Evolution to Ardbeg : Lessons Learned • Ardbeg achieved ~3x speedup overall at 30% lower power than SODA • To get these improvements many lessons were learned as a result of the studies done • We will present a few of these studies • 1) Benefit of Wide SIMD • 2) VLIW on SIMD support • 3) Support for Complex Shuffle Network • 4) Application Specific Hardware 14

t c 1 . 2 12 u Energy - Delay d o Area r 1 . 0 10 P y N a l o 0 . 8 8 e r D m - a y l i g 0 . 6 6 z r e e d n E A 0 . 4 4 r d e e a z i l a 0 . 2 2 m r o N 0 0 8 16 32 64 SIMD Width 1) Benefiting from Wide SIMD • Increasing SIMD width still a good idea for SDR • But area becomes a big concern • 32 wide 16-bit SIMD at 90nm seems a good fit 15

2) VLIW Support for Wide SIMD • VLIW execution on top of the SIMD datapath • 3 read ports, 2 write ports • Shared between SIMD units • 2-issue SIMD LIW • Only support the most frequently used SIMD op pairs AGU Data MEM E X 32-lane SIMD ALU W B AGU AGU SIMD RF E X 128-lane SSN W B Interconnects Interconnects E X SIMD scalar trans. unit W B SIMD scalar RF 16-bit ALU E X W B Scalar 16 16

Mem . Arith . Mult . Shuffle Trans . Move Comp . Mem . NA -- -- -- -- -- -- High NA -- -- -- -- -- Arith . High Mid NA -- -- -- -- Mult . Shuffle Low High Mid NA -- -- -- Trans . High Mid High Mid NA -- -- Move Low Low High Low Low NA -- Comp . Low Low Low Low Low Low NA 2) VLIW on SIMD Support • There is a distinct set of instructions that execute frequently at the same time • We want to take advantage of this in order to reduce complexity of VLIW 17

2 Read/ 2 Write (Single Issue) 3 Read/ 2 Write (Ardbeg) 4 Read/ 4 Write (Any two SIMD ops) 6 Read/ 5 Write (Any three SIMD ops) 1.2 1 0.8 0.6 Energy-Delay Product 0.4 0.2 0 FIR CFIR FFT Radix-2 FFT Radix-4 Viterbi K7 Viterbi K9 Average 2) VLIW on SIMD Support • 3 Read/ 2 Write provides us for the most case the best overall design point 18

3) Support for Shuffle Network AGU E X 32-lane SIMD ALU W B SIMD Data MEM SIMD RF E X 128-lane SSN W B Interconnects Interconnects E X SIMD scalar trans. unit W B Scalar Data MEM SIMD 2 stage 16-lane Banyan network scalar RF 16-bit ALU E X W B Scalar • 7-stage single-cycle SSN • Banyan network • 128-lane 8-bit (64-lane 16-bit) 19 19

1.2 1 0.8 Energy-Delay Product 0.6 0.4 0.2 0 64pt FFT 2048pt FFT 64pt FFT 2048pt FFT Viterbi K9 Radix-2 Radix-2 Radix-4 Radix-4 32 Wide Perfect 64 Wide Perfect 64 Wide Crossbar 64 Wide Banyan 3) Support for Shuffle Network • 64-Wide Banyan gives us close to a simple iterative interconnect energy with crossbar like performance 20

4) Application Specific Optimizations • Application specific hardware • Turbo coprocessor • Block-floating point support • Fused Permute-ALU operations • Interleaving support • Trade-off programmability for performance • Less “soft” than SODA • But more energy efficient for common operations 21 21

4) Application Specific Optimizations • Some kernels are common among many different protocols • Many protocols use the same Error Correction algorithms • Turbo Coprocessor is one of them • Tradeoff between Programmable vs ASIC • ASIC implementations is around 5x more efficient than programmable implementation • SODA PE: 2Mbps with 111mW in 90nm • ASIC: 2Mbps with 21mW in 90nm 22

Baseline SODA SIMD ALU SIMD Shuffle VLIW Compiler Optimization Error Filtering Modulation Synchronization 7 x Correction 4.5 4 3.5 3 2.5 2 1.5 Ardbeg Speedup Over SODA 1 0.5 0 QAM4 QAM16 QAM64 Average Average Average Average Bit Intlv 3 Bit Intlv 6 Combiner Viterbi K7 Viterbi K9 FIR 16-taps FIR 33-taps FIR 65-taps Interleaver Despreader CFIR 16-taps CFIR 33-taps CFIR 65-taps Descrambler FFT Rx2 64pt FFT Rx4 64pt FFT Rx2 2048pt FFT Rx4 2048pt DVB-T Equalizer DVB-T Chan. Est. W-CDMA Searcher 802.11a Interpolator Overall Improvements • Achieves between ~1.5-7x speedup for wireless algorithms compared to SODA 23

100 802.11a 802.11a 180nm 802.11a 802.11a 10 W-CDMA 2Mbps SODA ASIC 180nm W-CDMA 2Mbps W-CDMA 2Mbps Achieved Throughput (Mbps) 1 Sandblaster W-CDMA 2Mbps TigerSHARC W-CDMA data 7 Pentium M W-CDMA data 0.1 W-CDMA voice 0.01 0.01 0.1 1 10 100 1000 Power (Watts) Summary of Ardbeg • Power vs Throughput for protocols on different processors 24

100 802.11a 802.11a 802.11a 180nm 802.11a 802.11a 10 DVB-H W-CDMA 2Mbps Ardbeg DVB-T SODA W-CDMA 2Mbps ASIC Achieved Throughput (Mbps) 180nm W-CDMA 2Mbps W-CDMA 2Mbps 1 W-CDMA 2Mbps Sandblaster W-CDMA data TigerSHARC W-CDMA data W-CDMA data 7 Pentium M 0.1 W-CDMA voice W-CDMA voice 0.01 0.01 0.1 1 10 100 1000 Power (Watts) Summary of Ardbeg • Ardbeg is lower power at same throughput • We are getting closer to ASICs 25

Conclusion • SODA  Ardbeg • Overall ~1.5-7x improvement across multiple wireless algorithms • 30% less power over SODA (with turbo also in software) • Fully programmable research design evolved to a commercial design that is “less soft” • Feasible to design programmable solutions that start to approach ASIC efficiency • ASICs are locally optimal for single kernels but combined create an inefficient system • Programmability allows time multiplexing of hardware = Less hardware, same amount of work 26 26

Questions? Thanks! 27

From SODA to Scotch: The Evolution of a Wireless Baseband Processor