CLFFT: An FFT code generator for heterogeneous systems

CLFFT: An FFT code generator for heterogeneous systems Krishna G Pai Rejith George Joseph Girish Ravunnikutty

Agenda • FFT • Intro to CLFFT • Brief intro to OpenCL • Comparisons of FFT Algorithms • Comparison with CUFFT • Future work

Discrete Fourier Transform • Takes O(n2) with a naive implementation. • Fast Fourier Transforms (FFTs) are O(nlogn) implementation of DFTs. Image from Intel.com

Why FFT’s ? • Lots of Ongoing Research • FFTW ( http://fftw.org/) • Spiral (http://spiral.net/) • Spectral methods are one of the 13 Dwarfs of Parallel Computing • Rich set of Algorithms each optimal for certain ‘N’. • And of course, Wide applicability.

Our Approach • FFTW generates code that adapts to a particular architecture (CPU’s) • Spiral also the same but optimizes at compile time (Also CPU’s) • Other research that is optimized for GPU’s most notably Govindaraju et al. • Use all the available computing resources to make FFT’s really fast !

Heterogeneous Computing Intel Core 2 Duo Nvidia Tesla Use both these resources simultaneously

Intro to CLFFT • Future systems are going to be heterogeneous (multi core CPUs and GPGPUs as co processors) in nature. • Study various FFT algorithms and implementthem on a GPGPU and multi-core CPUs. • Explore how FFT's can be scheduled across both these computing resources and the performance thus obtained. • OpenCL to program the GPGPU and OpenMp to parallelize on CPU's.

FFT’s Studied .. • SlowFFT (Naive Implementation) • Cooley-Tookey (Radix 2 , for N = 2k) • Stockham (Radix 2 , for N = 2k) • Sande-Tookey (Radix 2 , Decimation in Frequency, for N = 2k) • Bluesteins (Radix 2 , for any N) • Cooley-Tookey and SlowFFT also parallelized with OpenMp.

Computational Parity • Intel Xeon has about 70 GFlops at peak performance • nVidia Tesla has about 933 Gflops • So not much computational parity on the hpc tesla machines • Better parity on Laptops with GPGPU’s. • Thus more work can be shared b/w CPU and GPU. Source Intel and Nvidia

Open CL • Standard for parallel programming of heterogeneous systems involving CPU, GPU(s), CPU + GPU, IBM cell blade etc. • So we can have portability across various architectures without a very great performance penalty* More on this when we compare matrix multiplication…

Differences w.r.t CUDA • No stand alone compiler to produce binaries. • We compile at run time . • Command Queues for launching kernels and Memory operations. • Device memory managed via buffer objects, which provides richer functionality than in CUDA • Allows a host memory region to be used by the device directly • OpenCl requires memcpy between device and host to be explicitly synchronized

OpenCL Implementations • On August 5, 2009, AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program • On August 28, 2009, Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL • September 28, 2009, NVIDIA released OpenCL drivers and SDK implementation.

Limitations • NvidiaopenCL supports only GPU as the openCL device. • Driver doesnt consider CPU as an openCL device. • Hence cannot invoke an openCL kernel on CPU. • Had to use openMP for CPU • AMD stream openCL has support for openCL on CPU

Work Flow • Currently , we split work b/w CPU and GPU’s only for Cooley-Tukey. • Cooley-Tukey here is radix-2 • Results are merged • On Tesla, bias is highly in favor of GPU computation • From the host one thread invokes OpenMp kernel and other threads equivalent to number of GPU’s invoke OpenCL Kernels

Comparison of all Radix 2

Comparison of Cooley-Tukey

CLFFT vsCuFFT for Power of 2

Performance Comparison

CuFFTvs CLFFT for any n

Future Work • Split Radix and Mixed Radix Algorithms • 2^p3^q5^r point FFTs • Winnograds Prime Number FFT • Optimize CPU implementations • Create Plan for an n and implement across multiple compute devices.

Thank You

CLFFT: An FFT code generator for heterogeneous systems