1 / 48

DSP in FPGA

DSP in FPGA. Topics. Considerations When not to use Floating Point Example FP: Adder Hardware Circuit Constant Cache Data-path with Constant Cache FFT Example Other Examples: Simulink Equalizer Routing Challenge Routing Resources: Altera vs. Xilinx Example: Matrix Multiplication

selene
Télécharger la présentation

DSP in FPGA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSP in FPGA

  2. Topics Considerations When not to use Floating Point Example FP: Adder Hardware Circuit Constant Cache Data-path with Constant Cache FFT Example Other Examples: Simulink Equalizer Routing Challenge Routing Resources: Altera vs. Xilinx Example: Matrix Multiplication Hypothesis and Rule’s of Thumb Results Paper Analysis • Signal Processing • FPGA Applications with DSP • DSP milestones • PDSP Architecture • PDSP vs FPGA • Example: FIR Filter • DSP on FPGA • State of the Art • Flexibility • Multi-Channel Friendly • Resources • DSP Slice • Multiplication Modes • IP Blocks • IP Block Example: FIR Filter

  3. Signal Processing • Transform or manipulate analog or digital signal. • Most frequent application: filtering. • DSP has replaced related traditional analog signal processing systems in many applications.

  4. FPGA’s Applications

  5. Milestones Cooley and Tukey 1965 PDSP 1970 Compute (fixed-point) “multiply-and-accumulate” in only one clock cycle • Efficient algorithm to compute the discrete Fourier Transform (DFT) Today PDSPs: Floating-point multipliers, barrel shifters, memory banks, zero-overhead interfaces to A/D and D/A Converters

  6. PDSP Architecture • Single-DSP implementations have insufficient processing power for today’s system’s complexity. • Multiple-chip systems: more costly, complex and higher power requirements. • Solution: FPGAs

  7. Managing Resources &Design Reliability

  8. FPGA vs. PDSPs PDSPs FPGA Implement MAC at higher cost. High-bandwithSP applications through multiple MAC cells on one chip. Algorithms: CORDIC, NTT or error-correction algorithms Dominate more front-end (sensor) applications FIR filters, CORDICalgorithms FFTs • RISC paradigm with MAC • Advantage: multistage pipeline architectures can achieve MAC rates limited only by speed of array multiplier. • Dominate applications that required complicated algorithms (e.g. several if-then-else constructs)

  9. FPGA Advantages • Ability to tailor the implementation to match system requirements. • Multiple-channel or high-speed system: take advantage of the parallelism within the device to maximize performance, • Control logic implemented in hardware

  10. Fir Filter Example

  11. FPGA

  12. State of the Art (Xilinx)

  13. Flexibility • How many MACsdo you need? • For example, in FIR Filter, FPGAs can meet various throughput requirement

  14. Multi-Channel Friendly • Parallelism enables efficient implementation of multi-channel into a single FPGA. • Many low sample rate channels can be multiplexed and processed at a higher rate.

  15. Resources • Challenge: How to make the best use of resources in most efficient manner?

  16. DSP48E1 Slice Flexibility • 2 DSP48E1 slices per tile • Column Structure to avoid routing delay • Pre-adder, 25x18 bit multiplier, accumulator • Pattern detect, logic operation, convergent/symmetric rounding • 638 MHz Fmax

  17. Multiplication Modes • Each DSP block in a Stratixdevice can implement: • Four 18x18-bit multiplications, • Eight 9x9-bit multiplication, or • One 36x36-bit multiplication • While configured in the 36x36 mode, the DSP block can also perform floating-point arithmetic.

  18. DSP IP Portfolio • Comprehensive • Constraint Driven

  19. IP Block example • Overclocking automatically used to reduce DSP slice count. • Quick estimates provided by IP compiler GUI • Insures best results for your design requirements.

  20. Altera: DFPAU • D-Floating Point Arithmetic Coprocessor. • Replaces C software functions by fast hardware operations – accelerates system performance • Uses specialized algorithms to compute arithmetic functions

  21. Altera: DFPAU

  22. Hardware circuit for FP adder • Breaking up an number into exponent and mantissa requires pre- and post-processing • Comprises • Alignment (100 ALMs) • Operation (21 ALMs) • Normalization (81 ALMs) • Rounding (50 ALMs) • Normalization and rounding together occupy half of the circuit area How to improve this?

  23. When not to use Floating Point? • Algorithms designed for fixed point • Greater precision and dynamic range are not helpful because algorithms are bit exact. • E.g. Transform to go to frequency domain in video codecs has some form of a DCT (Discrete Cosine Transform). • Designed to be performed on a fixed-point processor and are bit exact. Also, when precision is not as important as speed

  24. Constant Cache • Some applications load data from memory once and reuse it frequently • Could pose a bottleneck on performance. • What can we do? • Copying data to local memory  • may not be enough, as each work group would have to perform the copy operation • Solution • Create a constant cache that only loads data when it is not present within it, regardless of which workgroup requires the data i.e. FFT

  25. Datapath with a Constant Cache

  26. Example FFT Large computation, can be pre-computed

  27. Equalizer Example

  28. Routing Challenge

  29. Routing challenge • Designed performance achieved only when the datasets are readily accessed from fast on-chip SRAMs. • For large data sets, the main performance bottleneck is the off-chip memory bandwidth. • With DRAM, you can process data on stages with only a portion of dataset that fits on chip operated on at a time. • Available memory bandwidth determines performance.

  30. Routing Resources Xilinx: more local routing resources Altera: wide buses Also has value, because normally wide data vectors with 16 to 32 bits must be moved to the next DSP block. • Synergistic with DSP because most DSP algorithms process data locally.

  31. Example: Matrix Multiplication • Double-precisions FP cores (64 bits) • Matrix operations require all matrix element calculations to complete at the same time. • These parallelized or “vector” operations will occur at the slowest clock speed of all the FP functions in the FPGA.

  32. Routing Challenge • Hypothesis (constrained performance prediction): • Estimated 15 % logic unusable (due to data path routing, routing constraints, etc.) • Estimated 33 % decrease in FP function clock speed • Extra 24,000 ALUs for local SRAM memory controller and processor interface 39 +, 39 X Clock Speed: 200 Mhz Performance: 15.7 GFLOPS Peak is: 300 MHZ 25.5 GFLOPS

  33. Routing Challenge • Considerations: • Latency of transfer of A and B matrix from microprocessor to local FPGA SRAM not included in benchmark time. • Challenge when using all double-precision FP cores: feeding them with data on every clock cycle. When dealing with double-precision 64-bit data, and parallelizing many FP arithmetic cores, wide internal memory interfaces are needed.

  34. Routing Challenge: Results • Average sustained throughput : 88 percent. • 40 multiply and 40 adder tree cores – result every clock cycle • Five additional adder cores used for blocking implementation: one value per clock cycle • The GFLOPS calculation then is 200 MHz * 81 operators * 88 percent duty cycle = 14.25 GFLOPS. • Lower than expectation – due to the time needed to read and write values to the external SRAM. • With multiple SRAM banks providing higher memory bandwidth, the GFLOPS would be closer to the 15.7 GFLOPS number. • Power: • The expected 15 GFLOPS performance of the Stratix EP2S180 FPGA running at 30 W is close to the sustained performance of a 60-W 3-GHz Intel Woodcrest CPU

  35. FPGA implementations of fast Fourier transforms forreal-time signal & image processing I.S. Uzun, A. Amira and A. Bouridane

  36. Functional block diagram of 1-D FFT processor architecture

  37. AGU: Radix-2 DIF FFT • w s :¼ 1 • for stage :¼ log 2 ðNÞ to • 1 step 1 fnnstage loop • m :¼ 2^stage • is :¼ m=2 • w index0 :¼ 0 • for group :¼ 0 to n m step m • fnngroup loop • for bfi :¼ 0 to is l fnnbutterfly loop • Index0 :¼ r þ j • IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 3, June 2005 295 • Index1 :¼ Index0 þ is; • } • w index0 :¼ w index0 þ w s; • } • w s :¼ w s 1

More Related