170 likes | 293 Vues
This report explores the implementation of three versions of the Fast Fourier Transform (FFT) on the Cell Broadband Engine architecture using a simulator. It analyzes the performance of FFT on the Power Processing Element (PPE) and Single Processing Units (SPUs), focusing on data/task parallelism and pipelined implementations. The comparative study examines the number of cycles, compute time, and efficiency of different buffering techniques while highlighting challenges faced in programming the Cell architecture. Results indicate significant performance differences and suggest future work opportunities in FFT implementations on Cell architecture.
E N D
Fast Fourier Transform Implementation on Cell Broadband Engine Architecture Aarul Jain CSE520, Advanced Computer Architecture Fall 2007
OBJECTIVES • Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased. • Fast Fourier Transform on PPE/single SPU. • Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.) • Pipelined implementation on multiple SPUs. • Performance : • FFT kernel • DMA data transfer
CELL BE • PPE • 64bit Power architecture with VMX. • In-order, 2-way SMT. • 32KB L1, 512KB L2 Cache. • SPE • 256 KB local store. • In-order, No speculation. • 128 registers for all data types. • EIB • Four 16B data rings. • Over 100 outstanding requests.
FFT on single PPU/SPU • FFT compute intensity O(nlogn) • Implementation on PPU • Cache based memory architecture – No software controlled memory. • Implementation on SPU • Software controlled memory. • Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)
RESULTS (single PPU/SPU) N v/s cycles
Conclusions(single PPU/SPU) • Number of cycles on PPU and SPU scale with order NlogN. • Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access. • Very efficient DMA. • Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up. • DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: • Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) • Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)
Data/Task parallel on multiple SPUs • Multiple FFTs running on each SPU and each SPU works on different data. • Limitation of local store memory. • Single buffer approach => 8K points • Double buffer approach => 4K points • Single buffer v/s double buffer. • Performance as number of active SPUs are increased.
Conclusion(Data/Task parallel ) • More compute power with multi-processors • For FFT -> almost 8 times if thread creation is not counted. • Using double buffering may not always give speed advantage. (Amdahl’s law) • Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. • Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.
Comparison with published results • Reference http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1 No. of cycles for single 4K point FFT = 24688 No. of floating point operations = 4*1024*log(4*1024) = 49152 Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec MY RESULTS IBM RESULTS
Problems faced • CELL architecture and its programming environment is completely new. Unknown problems come up. • Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K. • Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. • Takes 2 days to run a 8K point fft. • Simulator crashing when mode is changed multiple times. • Debug support very complex.
SUGGESTIONS • Use the forum alphaworks: excellent forum with quick response time. • To profile accurately run simulation in cycle mode. • Commands for profiling • __mftb() -> FOR PPE • spu_writech(), spu_readch() -> FOR SPE
Future work • Pipelined implementation of FFT. • Standalone mode. • Higher order FFTs. • Compiler performance.
References • http://www.ibm.com/developerworks/forums/thread.jspa?threadID=160216 • Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11, 2007. • IBM Cell Broadband Engine Software Development Kit, http://alphaworks.ibm.com/tech/cellsw?open&S_TACT=105AGX16&S_CMP=DWPA • Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September 2005. • Perrone M., Introduction to the Cell Processor (lecture), http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf • Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February 2005. • Krewell K., Chips, Software, and Systems, Microprocessor Report, January 2005. • http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=182042
Double buffer Code loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();
BUS ERROR mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }