1 / 42

SWAPs: Re-thinking mobile and base-station architectures

SWAPs: Re-thinking mobile and base-station architectures. Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX 77005 March 23, 2003.

holmes-vang
Télécharger la présentation

SWAPs: Re-thinking mobile and base-station architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX 77005 March 23, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

  2. Wireless Cellular Wireless LAN Bluetooth/ Home Networks Future wireless devices : • High data rate mobile devices with multimedia • Seamless connection across environments and standards • Use the fastest and cheapest available service

  3. Aim of the talk • How do I build such a device? • Challenges • Constraints • Solutions

  4. Trend comparisons

  5. Application Layer Network Layer MAC Layer Physical Layer Change in flexibility requirements No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

  6. Summary of Challenges for • Sophisticated algorithms (GOPs of computation) • 10’s of Mbps, < 500 mW • Flexibility required at physical layer • Multiple algorithms, multiple standards, multiple environments • What we would also like: • Time to market • Rapid evaluation and implementation • Scalable architecture design methodologies

  7. Baseband processing Antenna Detection Decoding Higher (MAC/Network/OS) Layers RF Front-end Channel estimation Physical layer of a receiver Receiver more complex than transmitter

  8. Analog Baseband Analog RF Audio A/D D/A Digital Baseband DSP ASICs controller ro Physical layer architecture Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000

  9. Architecture trade-offs Programmable solutions • Past : more DSP + less ASIC • Current “proposed” solutions : less DSP + more ASICs • Reason: DSPs not powerful enough Can’t we build better DSPs? Area-Time-Power Performance Intermediate solutions Flexibility ASIC solutions

  10. Can this methodology scale for Baseband increasingly important for real-time and power • Need much more flexibility • Environment-specific sophisticated algorithms • Cannot keep adding co-processors • lose flexibility of a programmable solution • 1 Mbps with 100 MHz processor • 100 cycles per bit to do all your work (GOPs/bit) • Power consumption with bigger color displays, video and more complex algorithms • May have only ~100 mW for baseband

  11. Design me Motivation Now that we know the challenges and constraints,

  12. design How do we choose the right algorithms? the right amount of flexibility? Do we build DSPs, ASICs, heterogeneous, reconfigurable? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on • level of flexibility needed • area-time-power architecture tradeoffs

  13. My contributions “Low-complexity” algorithms for wireless: Parallel, fixed point algorithms for multiuser estimation and detection ASIC design for wireless using computer arithmetic techniques: Dynamic truncation using on-line arithmetic Programmable architecture design for wireless: Scalable Wireless Application-specific Processors (SWAPs)

  14. Programmable architectures • Current DSPs • Not enough functional units (FUs) for GOPs of computation • Cannot extend to more FUs • Limited Instruction Level Parallelism (ILP) • Limited Subword Parallelism (SP) • Cannot support more registers (register area increases quadratically with FUs) • Compilers: difficult to find ILP as FUs increase

  15. Solution • Exploit data parallelism (DP) • Lots available in wireless algorithms • Example: Int i,a,b,c; // 32 bits short int d,e,f; // 16 bitspacked for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } DP ILP SP

  16. Internal Memory Stream Register File ILP + + + + + + + + + + + + + + + + + + + + … ILP + + + + + + + + + + * * * * * * * * * * * * * * * * * * * * * * * DP * * * * * * * DSP vs. SWAPs DSP (1 cluster) SWAPs (max. clusters)

  17. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Stream Host Stream Register File Interface Controller Processor Network Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Builds on the Imagine media processor

  18. SWAPs trade-offs • Same internal memory size as DSPs • Dependent on application, not architecture • Needs more area to support more functional units • Area is less of a constraint than power • Varying levels of DP in applications • Needs reconfiguration!! • Need to turn off unused clusters • More parallelism  lower clock frequency  lower voltage low power (CV2f + leakage) in spite of larger area

  19. 7 8 7 6 5 4 3 1 2 Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point 1 High level language implementation specs : no Example: Pentium, DSP, SWAPs Programmable implementation Modular programmable architecture design learn Area-Time-Power specs: no FPGA, customized, reconfigurable, heterogeneous implementations learn ASIC implementation Example: H-SWAPs

  20. 0 10 Past -2 10 Current Bit Error Rate -4 10 Future -6 10 Theory -8 10 Signal to Noise Ratio Choosing the right algorithms : theory Algorithm research: • Spectral efficiency • Low power (RF) Metrics: • Bit error rate • Frame error rate

  21. 80 Original Candidate A 70 Candidate B 60 50 Execution Time 40 30 20 10 0 Complexity Complexity/Parallelism Choosing right algorithms : practice • Refine candidates from theory (using linear algebra / opt.) • lower “complexity”, parallel, fixed-point Optimization: Area: A Time: B Power: A Energy: A/B Multi-parameter optimization ? “Complexity” : #operations of equivalent type

  22. Example : Parallel Viterbi Decoding • Add-Compare-Select (ACS) : trellis interconnect • Re-order for exploiting DP • Parallelism depends on constraint length (#states) • Conventional Traceback – sequential • Use Register Exchange (RE) for parallel solution Exploiting DP in a programmable architecture implies: • Re-order ACS • Re-order RE

  23. SWAP design • Decide how many clusters • Exploit DP • Look at the for loop () count • Decide what to put within each cluster • Maximize ILP with high functional unit efficiency • Search design space • See how it meets time-area-power constraints

  24. Auto-exploration of adders and multipliers for kernel "acskc" (80,34) (85,24) (85,17) 160 (85,13) 140 (85,11) (70,59) 120 Instruction count with FU utilization(+,*) (73,41) 100 (62,62) (76,33) 80 (72,22) (65,45) (54,59) (43,58) (72,19) (47,43) (61,33) 60 (39,41) (60,26) (49,33) 40 (61,22) (40,32) (48,26) 1 1 (39,27) (50,22) 2 2 (39,22) 3 3 #Multipliers #Adders 4 4 5 5 What goes inside a cluster?

  25. a. Trellis X(0) X(0) X(0) X(0) X(1) X(1) X(1) X(2) X(2) X(4) X(2) X(2) X(3) X(3) X(3) X(6) X(4) X(8) X(4) X(4) X(5) X(5) X(5) X(10) X(12) X(6) X(6) X(6) X(7) X(7) X(7) X(14) X(1) X(8) X(8) X(8) X(9) X(9) X(3) X(9) X(10) X(10) X(5) X(10) X(11) X(11) X(7) X(11) X(12) X(12) X(9) X(12) X(13) X(13) X(11) X(13) X(14) X(14) X(13) X(14) X(15) X(15) X(15) X(15) Re-ordering for parallel Viterbi b. Shuffled Trellis

  26. Viterbi reconfiguration DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters)

  27. How to reconfigure? • Move data to appropriate clusters and turn off unused clusters and SRF • Significant loss in performance • Maximum power savings • Use Conditional Streams • Cannot turn off SRF, comm ,scratchpad in clusters • Minimal loss in performance • Use mux-demux buffers • Can turn off clusters entirely – more power savings • Minimal loss in performance

  28. 64-bit Packet 1 Rate ½ Constraint Length 7 Memory accesses 64-bit Packet 2 Rate ½ Constraint Length 9 Kernels (Computation) 64-bit Packet 3 Rate ½ Constraint Length 5

  29. 3 10 Actual K = 9 Actual K = 7 Actual K = 5 Regular code Reconfigurable code 2 10 Frequency needed to attain real-time (in MHz) 1 10 0 10 0 1 2 10 10 10 Number of clusters Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

  30. Actual K = 9 Actual K = 7 Actual K = 5 2 1 0 0 1 2 10 10 10 Virtex II FPGA* Viterbi decoding: Execution time 3 10 DSP (RE) Ideal DSP C64x (w/o co-proc) 10 DP SWAP 10 Task Pipelining Dedicated interconnect 10 128 KHz (1 bit /cycle) ASIC/FPGA – Real-time performance *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 

  31. Salient features of this solution • Any constraint length  10 MHz at 128 Kbps (handset) • Same code for all constraint lengths • no need to re-compile or load another code • as long as parallelism/cluster ratio is constant • Exploiting parallelism for real-time: • Instruction Level Parallelism (DSP) • Subword Parallelism (DSP) • Data Parallelism (Imagine) • Dynamic Cluster Scaling (SWAP) • Power savings due to dynamic cluster scaling

  32. Expected SWAP power numbers Viterbi decoding • 64 clusters and 1 multiplier per cluster: • Process: 0.13 micron • Voltage: 1.5 V (to min. leakage when not active) • R-T Frequency: f~10 MHz • Peak Active Power: ~16 mW/MHz (11 mW/MHz if 1.2V) • Area: ~53.7 mm2 • 10 MHz, 128 Kbps • ~160 (110) mW for K = 9 • ~53.33 (36.7) mW for K = 7 • ~26.67 (12.5) mW for K = 5 • ASICs : ~10-100 W *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164

  33. Problems • Suitable for handsets? - Not yet! • Still too general • Not low power enough!!! • No special customization for the application • Except for a fixed-point architecture • Generic instruction set • Generic ALUs (though, can be powered down) • Generic inter-cluster communication network • Suitable for base-stations? • Why not – power is not a primary constraint?

  34. 5 10 FAST MEDIUM SLOW 4 10 32-user 3G base-station Frequency needed to attain real-time (in MHz) 3 10 2 10 Hand-set 1 10 0 1 2 10 10 10 Number of clusters Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user

  35. Expected power numbers • 32 user base-station with 3 multipliers per cluster and 64 clusters: • Process: 0.13 micron • Voltage: 1.2 V (always active, leakage less important) • R-T Frequency: f~1 GHz • Peak Active Power: ~19.88 mW/MHz (increased *) • Area: ~93.4 mm2 • Total Base-station power consumption: • ~19.88 W at 1 GHz for 32 users at 128 Kbps/user

  36. Internal Memory + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + * + + * * + + + * + + + + + + + + + + + + … * * * * * * * * + + + + + + + + + * * * * * * * * * * * * * * * * * * Limited DP Limited DP Limited DP * * * * * * * * * Limited DP DP H-SWAPs (collection of customized SWAPlets) SWAPlet (limit clusters) H-SWAPs • Trade Data Parallelism for Task Pipelining • Customize each SWAPlet SWAPs (max. clusters and reconfigure)

  37. A A A A C C C C S S S S + + + + Limited DP Viterbi decoding • Survivor management – serial • Finding parallel solution for SWAPs – expensive • > 50% of execution time : overhead • Serial solution now possible with H-SWAPs ACS unit Traceback unit TBU H-SWAPs for Viterbi decoding

  38. DSP (RE) Partial DP + Task Pipelining Application-specific units DP SWAP H-SWAP Task Pipelining Dedicated interconnect Dedicated interconnect ASIC/FPGA – Real-time performance ASIC/FPGA – Real-time performance Potential advantages Performance DSP (RE) SWAPs H-SWAPs

  39. Current research • How to trade-off task vs. data parallelism? • Evaluation of specialized inter-cluster communication • Integrating specialized arithmetic units (ACS, on-line) • Area-Time-Power efficiency of Handset SWAPs • Learning to migrate from H-SWAPs to SWAPs • Scale to future systems!!

  40. Future research: efficient algorithms

  41. Future research: architectures Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs Some other potential applications • Image processing: • Cameras : variety of compression algorithms • Biomedical applications: • Hearing aids: DSP running on body heat* • Sensor networks *Quote: Gene Frantz, TI Fellow

  42. Conclusions • Exciting times for wireless algorithm and architecture research • More complex algorithms • Higher data rates – meet real-time requirements • Lower power • Low area • Seek to design flexible architectures • learn from ASIC solutions • Inter-disciplinary research needed: • Computer architecture, VLSI, wireless communications, computer arithmetic, compilers

More Related