210 likes | 351 Vues
This project aims to validate and improve the efficiency of complex multiplications in FFT algorithms, comparing results obtained using the Intel Compiler. We analyze different loop structures and the impact of bit reversal on performance. The inner loop optimizations reduce the number of multiplications significantly, demonstrating a reduction from 11 to just 3 multiplications. We also tabulate results from various iterations for different data sizes, providing comparative insights. Further improvements are suggested including a study of the FFTW implementation and considerations for faster digit and twiddle computations.
E N D
FFT Accelerator Project Date : February 23,2007 Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210)
Current Objectives • Validate the number of complex multiplications • Run the code with intel compiler and compare the results – • For single run • For multiple runs • Tabulate all the results • Analyse these using vTune
Number of Complex multiplications • Our results • (11/4)*nlog4(n) =8960 • Result on net • (3/4)*nlog4(n) = 3840 • The inner loop is trivial and does not require any “complex multiplications”
Inner loop of our Algorithm TA[k+j] Uw*A[k+j+m/4] Vw*w*A[k+j+m/2] Xw*w*w*A[k+j+3*m/4] A[k+j]T+U+V+X A[k+j+m/4]T+(i)U-V-(i)X A[k+j+2m/4]T-U+V-X A[k+j+3m/4]T-(i)U-V+(i)X Ww*w_m Total number of multiplications n this loop : 11
New Inner loop of our Algorithm • TA[k+j] • Utwiddle[k]*A[k+j+m/4] • Vtwiddle[2*k]*A[k+j+m/2] • Xtwiddle[3*k]*A[k+j+3*m/4] • A[k+j]T+U+V+X • A[k+j+m/4]T+i*U-V-i*X • A[k+j+2m/4]T-U+V-X • A[k+j+3m/4]T-i*U-V+i*X Total number of multiplications n this loop : 3 (3/4)*nlog4(n) =3840
Stuff we tried • Improved the “bit reversal” • Better than the last time • Though inefficient (O(nlogn)), still works faster than the previous implementation • Still there exists many fast algorithms
System Specifications • Processor: Intel Pentium 4 CPU 3.00Ghz • Cache Size: 1MB • RAM: 1GB • Flags supported : sse, sse2
Results User time(ms) for 1024 points (single iteration)
Results User time(ms) for 1024 points (10 iterations)
Results User time for 4096 points (single iteration)
Results User time(ms) for 4096 points (10 iterations)
Results User time(ms) for 262144 points (single iteration)
Results User time(ms) for 262144 points (10 iterations)
Analysis • Results are comparable due to the following reasons • Change in bit reversal • Number of computations • FFTW : compiling option gcc • Got to re-write the code for arbitrary number of points
Vtune Analysis • TODO • Vtune (not available)
Further Improvements • Fast digit reversal • Fast “twiddle compute” • TODO: • Comparison with Intel Math Kernel library • Study FFTW implementation • Vtune Analysis • Try winograd algorithm • Code more efficiently
References • Alan H. Karp “Bit Reversal on Uniprocessors” • Angelo A. Yong “A better FFT Bit-reversal Algorithm”