Fast Fourier Transform Implementation on Cell Broadband Engine Architecture

Fast Fourier Transform Implementation on Cell Broadband Engine Architecture Aarul Jain CSE520, Advanced Computer Architecture Fall 2007

OBJECTIVES • Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased. • Fast Fourier Transform on PPE/single SPU. • Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.) • Pipelined implementation on multiple SPUs. • Performance : • FFT kernel • DMA data transfer

CELL BE • PPE • 64bit Power architecture with VMX. • In-order, 2-way SMT. • 32KB L1, 512KB L2 Cache. • SPE • 256 KB local store. • In-order, No speculation. • 128 registers for all data types. • EIB • Four 16B data rings. • Over 100 outstanding requests.

FFT on single PPU/SPU • FFT compute intensity O(nlogn) • Implementation on PPU • Cache based memory architecture – No software controlled memory. • Implementation on SPU • Software controlled memory. • Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)

RESULTS (single PPU/SPU) N v/s cycles

Conclusions(single PPU/SPU) • Number of cycles on PPU and SPU scale with order NlogN. • Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access. • Very efficient DMA. • Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up. • DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: • Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) • Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)

Data/Task parallel on multiple SPUs • Multiple FFTs running on each SPU and each SPU works on different data. • Limitation of local store memory. • Single buffer approach => 8K points • Double buffer approach => 4K points • Single buffer v/s double buffer. • Performance as number of active SPUs are increased.

RESULTS(Data/Task parallel )

Conclusion(Data/Task parallel ) • More compute power with multi-processors • For FFT -> almost 8 times if thread creation is not counted. • Using double buffering may not always give speed advantage. (Amdahl’s law) • Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. • Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.

Comparison with published results • Reference http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1 No. of cycles for single 4K point FFT = 24688 No. of floating point operations = 4*1024*log(4*1024) = 49152 Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec MY RESULTS IBM RESULTS

Problems faced • CELL architecture and its programming environment is completely new. Unknown problems come up. • Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K. • Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. • Takes 2 days to run a 8K point fft. • Simulator crashing when mode is changed multiple times. • Debug support very complex.

SUGGESTIONS • Use the forum alphaworks: excellent forum with quick response time. • To profile accurately run simulation in cycle mode. • Commands for profiling • __mftb() -> FOR PPE • spu_writech(), spu_readch() -> FOR SPE

Future work • Pipelined implementation of FFT. • Standalone mode. • Higher order FFTs. • Compiler performance.

References • http://www.ibm.com/developerworks/forums/thread.jspa?threadID=160216 • Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11, 2007. • IBM Cell Broadband Engine Software Development Kit, http://alphaworks.ibm.com/tech/cellsw?open&S_TACT=105AGX16&S_CMP=DWPA • Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September 2005. • Perrone M., Introduction to the Cell Processor (lecture), http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf • Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February 2005. • Krewell K., Chips, Software, and Systems, Microprocessor Report, January 2005. • http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=182042

Q/A

Double buffer Code loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();

BUS ERROR mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }

Fast Fourier Transform Implementation on Cell Broadband Engine Architecture