1 / 28

FFT Accelerator Project

FFT Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). September 27 th ,2007. Overview. Multiprocessor Implementation Problems faced Solutions Results FPGA IO Work done Problems faced Possible solutions. MultiprocessorFFT: Problems.

nailah
Télécharger la présentation

FFT Accelerator Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) September 27th,2007

  2. Overview • Multiprocessor Implementation • Problems faced • Solutions • Results • FPGA IO • Work done • Problems faced • Possible solutions

  3. MultiprocessorFFT: Problems • The previous code worked for some inputs but not all • The program seemed to communicate well but still error prone • Lots of segmentation faults (even after getting the results) • Serial debugger does not work • Commercial debuggers available, but evaluation is restricted to single IP, 30 days

  4. Suggested solutions (lam-mpi/google groups) • “Execution Environment does not match the compile environment” • Same code worked with MPICH version 2, GCC • Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me) • Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked)

  5. System Info (Identical for all) • Machine 1: Saveri • Machine 2: Abhogi • Machine 3: Sahana • Machine 4: Jaunpuri • Sysinfo : • Intel Pentium 4, 3.4 GHz • Cache Size: 2048KB • RAM 1GB • Operating System : Fedora Core 6 • Compiler : mpic++ • Flags: -O3 –march=pentium4 • FFT : radix 2

  6. Theoretical Execution time • For p processors, the total execution time is : (TN/p) + (1 – 1/p)(2N/B + KN) • p is a power of 2 • TN is the time taken to compute the FFT of input size N • KN is the time taken to combine two N-point FFT’s • B is the network bandwidth (bytes/sec)

  7. Nature of this function • Sum of two functions – • (TN/p) • (1 – 1/p)(2N/B + KN) • When (TN/p) dominates • When (1 – 1/p)(2N/B + KN) dominates

  8. Input: 8388608

  9. Input: 8388608

  10. Input: 8388608

  11. Input: 16777216

  12. Input: 16777216

  13. Input: 16777216

  14. Input: 33554432

  15. Input: 33554432

  16. Input: 33554432

  17. Input: 67108864

  18. Input: 67108864

  19. Input: 67108864

  20. Inference • Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup) • Below this point • the execution time increases with the increase in # processors • the %age communication time decreases as the #processors increase • Above this point • the execution time decreases with the increase in #processors • the %age communication time increases as the #processors decreases

  21. Possible errors • Measuring real time which is affected by the load on a particular processor • Network Communication latency affects the time taken to establish a synchronous handshake • The pipeline is actually not “perfect”

  22. 4 processor pipelined layout Send(2) P4 Recv(2) FFT(N/4) Send(1) Recv(1) FFT(N/4) P3 Recv(4) Combine Send(1) Recv(1) Send(4) FFT(N/4) P2 Recv(3) Recv(1) Combine Send(2) Send(3) FFT(N/4) Combine P1 (KN/2B) (N/2B) (N/2B) (N/4B) (TN/4) (N/4B) (KN/4B) Time taken by these can surpass the boundaries

  23. Further Work • Rewrite the code with new data type in C • Optimize the code • Try with more processors ? • Analyze using profilers ?

  24. FPGA: PCI IO • Built and ran admxrc2 demos • Studied the wrapper and vhdl codes • Struct ADMXRC2_SPACE_INFO • The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers.

  25. Mapping to logical space • All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs • This approach makes the vhdl code card-dependent

  26. FPGA: Next step • There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls • See which of the two approaches is more useful and work with it • DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing)

  27. References • ADM-XRC-II user manual • www.forums.xilinx.com • www.fpga-faq.org

  28. Thank you

More Related