350 likes | 444 Vues
Explore the background, core algorithms, and implementations of DFT and FFT in hardware and software. Understand the computation differences, advantages, and the butterfly operations in FFT. Discover the speed and efficiency boost using hardware, FPGA applications, and mitigation strategies for complex hardware issues. Learn about the challenges of software implementations and the results of testing FFT algorithms in C++ and VHDL. Delve into product offerings from DSP, FPGA, and microcontroller vendors, as well as libraries like FFTW, AMD Math Core, and Intel Library. Published results showcase the performance of Radix 4 and Radix 2 FFT versions in terms of processing time and resource utilization.
E N D
Background • Core Algorithm • Original Algorithm, the DFT, O(n2) complexity • New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation.
DFT Computation • A summation over the whole input array for every single element in the output array. • A VERY computationally inefficient algorithm to implement.
FFT Computation • A much more computationally efficient algorithm • Works using the divide and conquer principle. • First developed by Cooley and Tukey in 1965!
FFT Butterfly Operations • Butterfly arrangement of computations • Repeated on successive pairs of input data • Then half as many times on alternating pairs • Then half again as many times on every fourth element • …
xe[n] X[n] WnN xo[n] X[n+N/2] -WnN The Butterfly • Simple operations repeated many times
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationThe Entire Calculation Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
Why Hardware? • Even more speed for FFT • Extremely parallelizable • A whole layer can be done in two FPGA clock cycles • 1 multiply cycle • 1 add cycle • (Assuming sufficient multipliers)
Hardware Problems • Complexity • Input speed • Output speed • If the FPGA takes 24.4ns but takes 20s to transfer the input data, what gain is there? • i.e. 24.4ns + 20s + 20s = ~40s!
Mitigation of Hardware Problems • Use a faster bus • AMD Opteron’s Hypertransport • 20.8 GB/s (166.4 Gb/s) per Link (V. 3) • Modules that fit into an AMD 64-bit Opteron Socket • http://www.drccomputer.com/pages/modules.html - xilinx based module • http://www.xtremedatainc.com/xd1000_brief.html - altera based module
Mitigation of Hardware Problems • Put the FPGA on the die with the DSP • Need silicon vendor support • FPGA can access memory on a very wide bus (i.e. 128 bits per cycle) • Implement the entire project in FPGA • Time consuming to program • Possibly insufficient room on the FPGA
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
Why Not Software? • Each butterfly must be done sequentially • Only slight parallelism enabled by a DSP like the TigerSHARC • Each Butterfly can be done in 2 cycles (after optimization).
Results of Testing • Linear Profiling of FFT Algorithm in C++
Results of Testing • Profiling of VHDL on FPGA • Butterfly takes 24.377ns to execute • 62% is computational, 38% is routing on FPGA
Product Offerings • Most DSP Vendors • Many FPGA Vendors (IP – Intellectual Property) • Microcontroller Vendors (i.e. Blackfin) • FFTW – The Fastest Fourier Transform in the West • AMD Math Core Library • Intel Library • Highly Optimized for the expected hardware
Published Results • The Radix 4 version delivers a 1 K points complex processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200-MHz system speeds. Additional versions of the new cores are under development. [6]
References [1] Signals Systems and Transforms [2] James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput.19, 297–301 (1965). [3] http://www.drccomputer.com/pages/modules.html - xilinx based module [4] http://www.xtremedatainc.com/xd1000_brief.html - altera based module [5] http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_2353,00.html [6] http://www.us.design-reuse.com/news/news5650.html [7] http://www.4dsp.com/fft.htm