1 / 21

Automatic Tuning for Parallel FFTs

Automatic Tuning for Parallel FFTs. Daisuke Takahashi University of Tsukuba, Japan. Outline. Background Objectives Approach Block Six-Step/Nine-Step FFT Algorithm Automatic Tuning for Parallel FFTs Performance Results Conclusion. Background.

darice
Télécharger la présentation

Automatic Tuning for Parallel FFTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Tuning for Parallel FFTs Daisuke Takahashi University of Tsukuba, Japan Second French-Japanese PAAP Workshop

  2. Outline • Background • Objectives • Approach • Block Six-Step/Nine-Step FFT Algorithm • Automatic Tuning for Parallel FFTs • Performance Results • Conclusion Second French-Japanese PAAP Workshop

  3. Background • The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. • Parallel FFT algorithms on distributed-memory parallel computers have been well studied. • Many numerical libraries with an automatic performance tuning have been developed, e.g., ATLAS, FFTW, and I-LIB. Second French-Japanese PAAP Workshop

  4. Background (cont’d) • One goal for large FFTs is to minimize the number of cache misses. • Many FFT algorithms work well when data setsfit into a cache. • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. • We modified the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”. Second French-Japanese PAAP Workshop

  5. Related Works • FFTW [Frigo and Johnson (MIT)] • The recursive call is employed to access main memory hierarchically. • This technique is very effective in the case that the total amount of data is not so much greater than the cache size. • For 1-D parallel MPI FFT, the six-step FFT is used. • http://www.fftw.org • SPIRAL [Pueschel et al. (CMU)] • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms. • http://www.spiral.net Second French-Japanese PAAP Workshop

  6. FFTE: A High-Performance FFT Library • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions. • It includes complex, mixed-radix and parallel transforms. • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI) • It also supports Intel’s SSE2/SSE3 instructions. • HPC Challenge Benchmark • FFTE’s 1-D parallel FFT routine has been incorporated into the HPC Challenge (HPCC) benchmark • http://www.ffte.jp Second French-Japanese PAAP Workshop

  7. Objectives • To improve the performance, we need to select the optimal parameters according to the computational environment and the problem size. • We implement an automatic tuning facility for parallel 1-D FFT routine in the FFTE library. Second French-Japanese PAAP Workshop

  8. Discrete Fourier Transform (DFT) • DFT is given by Second French-Japanese PAAP Workshop

  9. 2-D Formulation • If has factors and then Second French-Japanese PAAP Workshop

  10. Six-Step FFT Algorithm individual Transpose -point FFTs Transpose Transpose Second French-Japanese PAAP Workshop

  11. Block Six-Step FFT Algorithm PartialTranspose individual -point FFTs Transpose PartialTranspose Second French-Japanese PAAP Workshop

  12. 3-D Formulation • For very large FFTs, we should switch to a 3-D formulation. • If has factors , and then Second French-Japanese PAAP Workshop

  13. Parallel Block Nine-Step FFT PartialTranspose All-to-all comm. PartialTranspose PartialTranspose Second French-Japanese PAAP Workshop

  14. Automatic Tuning for Parallel FFTs • If the condition of is satisfied, then we can choose the arbitrary , and ,where . • In the original FFTE library, we chose • The blocking parameter can be also varied. • For a given , the best block size is determined by the L2 cache size. • In the original FFTE, for Xeon processor. • We implemented the automatic tuning facility for varying , , and . Second French-Japanese PAAP Workshop

  15. Second French-Japanese PAAP Workshop

  16. Performance Results • To evaluate parallel 1-D FFTs, we compared • FFTE (ver 4.0) • FFTE (ver 4.0) with automatic tuning • FFTW (ver. 3.2alpha3) • “mpi-bench” with “PATIENT” planner was used. • Target parallel machine: • A 16-node dual-core Xeon PC cluster(Woodcrest 2.4GHz, 2GB SDRAM/node, Linux 2.6.18). • Interconnected through a Gigabit Ethernet switch. • Open MPI 1.2.5 was used as a communication library • The compilers used were Intel C compiler 10.1 and Intel Fortran compiler 10.1. Second French-Japanese PAAP Workshop

  17. Second French-Japanese PAAP Workshop

  18. Second French-Japanese PAAP Workshop

  19. Results of Automatic Tuning on dual-core Xeon 2.4GHz PC cluster Second French-Japanese PAAP Workshop

  20. Discussion • For N = 2^28 and P = 32, the FFTE with automatic tuning runs about 1.25 times faster than the FFTW. • Since the FFTW uses the six-step FFT, each column FFT does not fit into the L1 data cache. • Moreover, the FFTE exploits the SSE3 instructions. • These are two reasons why the FFTE is most advantageous than the FFTW. • We can clearly see that all-to-all communication overhead contributes significantly to the execution time. Second French-Japanese PAAP Workshop

  21. Conclusions • We proposed the automatic tuning method for parallel 1-D FFTs on distributed-memory parallel computers. • A blocking algorithm for parallel 1-D FFTs utilizes cache memory effectively. • We found that the default parameters of the FFTE is not always optimal according to the results of the automatic tuning. • The performance of the FFTE with automatic tuning is better than that of the FFTW. Second French-Japanese PAAP Workshop

More Related