1 / 29

A programmable communications processor for future wireless systems

A programmable communications processor for future wireless systems. Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang. This work has been supported by Nokia, TI, TATP and NSF. Overview of research at Rice. Center for Multimedia Communications

oistin
Télécharger la présentation

A programmable communications processor for future wireless systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A programmable communications processor for future wireless systems Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF

  2. Overview of research at Rice • Center for Multimedia Communications • Behnaam Aazhang (wireless communications) • Joseph R. Cavallaro (VLSI signal processing) http://cmc.rice.edu • Computer Architecture • Scott Rixner (Microprocessor architecture) • Vijay Pai (Simulators, Network Processors) http://www.cs.rice.edu/CS/Architecture

  3. Baseband Programmable A/D Wireless Mobile RF Unit D/A device Communications Processor Motivation Mobile: Switch between standards and between parameters Base-station: varying no. of users with different parameters Programmability - flexibility is good

  4. GPP DSP Performance Flexibility FPGA VLSI Motivation

  5. Lower bounds on + and * for a 500 MHz system Estimation, Detection and Decoding in a W-CDMA multiuser system 3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 Adders/Multipliers required to meet real-time SLOW FADING (estimation every 1000 bits) 1 10 DATA RATES Add Mul 0 10 0 50 100 150 200 250 300 Number of users

  6. The Problem • Algorithms well understood at data-flow level • Can design real-time systems in VLSI. • Pushing implementation higher in the chain • Current DSPs not powerful enough for our application • Use an architecture simulator to design our own

  7. Proposed solution < x cm Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset) Current solutions to meet real-time (Racks of DSPs)

  8. Algorithm (in Matlab) New architecture design New Algorithm Characteristics? Complexity ? Real-Time (Area/Power) Requirements Parameter-free Architecture Design Operation Count Parallelize ? Fixed point ? Processor Architecture Parameters (# Functional units, # registers, # memory ....) Compiler Architecture Synthesizer Architecture Code Future Work Ph.D. Thesis Outline

  9. Advantages of this solution • Fast and smooth transition to future standards that simultaneously meets real-time and other constraints • Avoids re-designing the system from scratch • Joint algorithm–architecture hardware-software co-design • Matlab code can be re-used when new standards are being designed. • Tries to account for data rate increases and future algorithm changes

  10. Past research contributions Multiuser channel estimation Multiuser detection Distant Past Algorithms VLSI Task-partitioning Parallelism Pipelining FPGA System Design Recent Past Conventional arithmetic On-line arithmetic DSP Recent and Near Future Architecture innovations Functional unit design and usage IMAGINE

  11. Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results

  12. Typical workload representation (Base-station) • Equalization? • FFT • Viterbi decoding • Multiuser channel estimation • Multiuser detection • Viterbi decoding • Turbo decoding • Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes

  13. Parallel estimation/detection/decoding • Multiuser estimation • replaced matrix inversion by gradient descent • Multiuser detection • Parallel Interference Cancellation (PIC) • Pipelined algorithm that avoids block-based detection • Viterbi decoding • Trellis structures suited for decoding • Register exchange for survivor memory • No traceback latency

  14. Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Massaging matrices for detection Kernel 4, 5 Multiuser Detection Kernel 6, 7

  15. a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable Trellis X(0) X(0) X(0) X(0) X(0) X(0) X(1) X(1) X(1) X(2) X(1) X(1) X(2) X(2) X(4) X(2) X(2) X(2) X(3) X(3) X(6) X(3) X(3) X(3) X(4) X(4) X(4) X(8) X(4) X(4) X(10) X(5) X(5) X(5) X(5) X(5) X(12) X(6) X(6) X(6) X(6) X(6) X(7) X(7) X(7) X(7) X(14) X(7) X(8) X(8) X(8) X(8) X(1) X(8) X(9) X(3) X(9) X(9) X(9) X(9) X(10) X(10) X(5) X(10) X(10) X(10) X(11) X(7) X(11) X(11) X(11) X(11) X(12) X(9) X(12) X(12) X(12) X(12) X(13) X(13) X(13) X(11) X(13) X(13) X(13) X(14) X(14) X(14) X(14) X(14) X(15) X(15) X(15) X(15) X(15) X(15) Trellis for rate ½ code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2k Maximum 8 parallel units for rate ½ with 16 states

  16. Survivor Management in Viterbi • Two techniques • Traceback : Commonly used • Register Exchange • Traceback is good for VLSI architectures • Drawback: Sequential and additional latency • Register exchange is good for programmable solutions • Parallel updates • Packing decoded bits in the register needs to access the entire register

  17. Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results

  18. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The IMAGINE architecture

  19. Why IMAGINE simulator? • RSIM, SimpleScalar: GPP simulators • Great for media processing algorithms • Has a VLIW-based cluster -- DSP comparisons • A good base architecture : 1024-pt FFT

  20. Simulator knobs that we can turn • Cycle-accurate simulator • Varying number of Functional units and their design • Varying memory, register sizes • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … • Almost anything can be changed, some changes easier than others!

  21. Programming Imagine • 2 level C++ programming • StreamC: • transfers streams of data between main memory and stream register file (SRF) • KernelC: • transfers streams from the SRF to the ALU clusters • Code optimized to the number of ALU clusters and the size of the data

  22. Contents • Motivation • Parallel algorithms for estimation/detection/decoding • The “Imagine” simulator • Performance comparisons and results

  23. Communication (waiting for input) Kernel 2 (mmult) for 3 +,2*Adders have limited FU utilizationO(N3) *, O(N3) +Multipliers 100% in loopDivider not being utilizedReplace / with * FU unavailable (input ready but FU busy) TIME LOOP

  24. Kernel 2 (mmult)for 3 +,3*better adder utilization needs sufficient registers for scaling [register allocation may fail]code may also need slight tuning of variables for optimization TIME

  25. Kernel computational time Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

  26. Memory operations Kernels (Micro-controller executing) Initialization Idle time between kernels Communication overhead

  27. -2 10 1 DSP -3 10 2 DSPs -4 Execution time (in seconds) 10 IMAGINE with increasing functional units Efficiency = ? -5 10 Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user I Our architecture based on Imagine -6 10 0 5 10 15 20 25 30 35 Users Comparisons with TI C6701 DSPs

  28. Future work • Real-time design possible with larger number of functional units but efficiency is the key • Eliminating communication stalls between kernels • Support for matrix transposes and bit-level operations • Power and area constraints • Scalability with data rates – Boundaries of architecture • Handset algorithms

  29. Conclusions • Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY • The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power. http://www.ece.rice.edu/~sridhar/ sridhar@rice.edu

More Related