Single Source Compiler for Cell
Single Source Compiler for Cell
Single Source Compiler for Cell
E N D
Presentation Transcript
http://w3.ibm.com/ibm/presentations Single Source Compiler for Cell Single Source Compiler for Cell g g p p Presented by Tao Zhang IBM R IBM Research h © 2002 IBM Corporation
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Outline ? Compiler Overview ? Compiler Overview ? Compiler in detail p ? Code Generation ? Thread and Synchronization y ? Data Management ? Conclusion Single Source Compiler for Cell 2
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Tasks to Exploit Heterogeneous Multi Tasks to Exploit Heterogeneous Multi- -Cores in Cell p g p g Cores in Cell ? Partition application into PPE and SPE portions ? SPE has very limited local memory ? Both code size problem and data size problem ? Code and compile PPE and SPE portions separately ? Coding and compiling for two distinct ISAs ? Parallelize across multiple SPEs/PPE ? Simdization inside a single SPE ? Orchestration of DMA data transfers ? Synchronization between SPEs/PPE ? Synchronization between SPEs/PPE ? Quite complicated! Single Source Compiler for Cell 3
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Single Source Compiler for Cell Single Source Compiler for Cell g g p p ? Single set of Source files ? OpenMP directives natural choice – wide acceptance, simplicity ? Compiler, guided by OpenMP pragmas, generate optimized code for Cell ? code partitioning between PPE and SPEs ? Parallelization between SPEs and PPE ? Auto-simdization ? Data transfers ? Synchronization ? Code size ? User interactions supported ? Performance tuning compiler options ? Hand-optimized functions provided by user Single Source Compiler for Cell 4
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Example Example p p 2. Compile (simple!) 1. Code cbexlc -O5 t1.c -o t1 main() { int i; int step; int sum = 0; for(i=0;i<N;i++) { a[i]=i; } #include <stdio.h> #define N 512 int a[N]; int a[N]; int b[N]; 3. Run on Cell blade p; $ ./t1 omptest output t1: 130946816 Correct! int blether() { int i, j; #pragma omp parallel for p g for (i=0; i<N; i++) { b[i] = a[i] + i*1000; } return 0; } p p blether(); for(i=0;i<N;i++) sum += b[i]; printf( "omptest output t1: %d\n" sum); printf( omptest output t1: %d\n , sum); if (sum == 130946816) { printf( "Correct!\n"); return 0; } else { } else { printf( "Error! Incorrect checksum\n"); return 1; } } Single Source Compiler for Cell 5
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Project road map Project road map j j p p ?Started in 2003 at Watson ? Watson team: Kevin O’Brien Kathryn O’Brien Alexandre Eichengberger Tong ? Watson team: Kevin O Brien, Kathryn O Brien, Alexandre Eichengberger, Tong Chen, Zehra Sura, Tao Zhang, Peng Wu ?AlphaWorks release Nov. 2006 p ? With help from Toronto compiler group (Guansong Zhang is our contact) ?D ?DevelopWorks release Oct. 2007 l W k l O t 2007 ? Technique transfer to Toronto (Guansong Zhang, Amy Wang, …) and CRL ( Haibo Lin and Tao Liu ) Lin and Tao Liu ) ?GA release expected Dec. 2008 Single Source Compiler for Cell 6
http://w3.ibm.com/ibm/presentations Cell Broadband Engine IBM XL Compiler Framework IBM XL Compiler Framework p p ? IBM XL compiler already has robust infrastructure for building OpenMP compiler for Cell ? OpenMP and Auto parallelization currently supported in product compilers for other targets ? Dependence Analysis, Interprocedural Analysis, Profile Directed Feedback ? Aggressive Loop Optimizations and loop restructuring ? Function Level Partitioning to reduce compilation unit size ? Auto-Simdization ? And of course .. Supports multiple input languages and already targets a variety of different architectures ? PPE and SPU compiler Single Source Compiler for Cell 7
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Components of Cell Single Source Compiler Components of Cell Single Source Compiler p p g g p p ? Compiler transformations + runtime library support ? C ? Compiler il ? Translates OpenMP pragmas in the source code ? Implements corresponding OpenMP constructs ? Outlines code segment enclosed in parallel constructs ? Insert proper OpenMP runtime library calls ? Runtime library ? Provides basic utilities for OpenMP on Cell ? Thread management, work distribution, synchronization and etc ? Thread management, work distribution, synchronization and etc Single Source Compiler for Cell 8
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Performance of OpenMP vs hand optimization Performance of OpenMP vs. hand-optimization 40 s 35 8 S P E p w i t h p e e d u 30 25 25 20 15 15 10 5 5 S 0 dotproduct fft stream-add stream-copy stream-scale stream-triad xor OpenMP+simd p Hand-opt p ?Simple streaming benchmarks, speedups of 8 SPEs vs. 1 PPE ?Except for FFT, OpenMP compiler performs comparably against hand-optimized code Single Source Compiler for Cell 9
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Performance on Large Benchmarks Performance on Large Benchmarks g g Normalized Performance (8 SPEs vs. 1 PPE) 9 8 7 6 5 4 4 3 2 1 0 BT CG EP FT IS LU MG SP swim LBM PDE ? QS20 machine, NAS benchmarks, one SPEC OMP2000, one SPEC2006, financial application, speedups against 1 PPE ? Acceptable performance on average p p ? Some results are impressive g Single Source Compiler for Cell 10
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Scalability on Large Benchmarks Scalability on Large Benchmarks y y g g Normalized Performance against 1 SPE 9 9 2SPE 4SPE 8SPE 8 7 6 5 4 3 2 1 0 BT CG EP FT IS LU MG SP LBM PDE ?QS20 machine, speedups against 1 SPE ?S ?Scalability generally very good l bilit ll d Single Source Compiler for Cell 11
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Outline ? Compiler Overview ? Compiler Overview ? Compiler in detail p ? Code Generation ? Thread and Synchronization y ? Data Management ? Conclusion Single Source Compiler for Cell 12
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation Code Generation – – Outlining and Cloning Outlining and Cloning g g g g ?Outlining for parallel constructs ?Outlined function may be executed on both PPE and SPEs ?Outlined function may be executed on both PPE and SPEs ? Due to the heterogeneity of Cell ? Cl ? Cloning the outlined function, one copy for PPE and one copy for SPE i th tli d f ti f PPE d f ? Enable architecture specific optimizations later ? Enable architecture specific optimizations later ? Functions for PPE and SPE are separated into different compilation units and different backends are invoked ? PPE and SPE object files generated respectively ? After SPE binary generated, it is embedded and linked with other PPE objects te S b a y ge e ated, t s e bedded a d into final PPE binary ed t ot e objects Single Source Compiler for Cell 13
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation Code Generation – – Code Overlay Support Code Overlay Support y y pp pp ? When the size of the functions for SPEs is too big, code cannot fit into local storage g ? Lots of parallel loops ? Call-graph partitioning and code overlay support (in SDK) ? Call graph partitioning and code overlay support (in SDK) ? Partition the sub-graph of SPE functions into several partitions ? Each call graph partition generates one SPE object file ? Create a code overlay for each SPE object file ? Code overlays share code address space ? Load into local memory on demand ? Code size normally not an issue ? Except for huge single function Single Source Compiler for Cell 14
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation Code Generation – – Example Single source Example p p foo3(LB,UB) for (i= LB; i < UB; i+ + ) ( ; A[i] = x * B[i]; Runtime barrier ; ) f foo1 (); 1 () #pragma omp parallel for for (i= 0; i < N; i+ + ) ( ; A[i] = x * B[i]; ; ) foo3_SPU (LB,UB) foo2 (); for (i= LB; i < UB; i+ + ) A[i] = x * B[i]; Runtime barrier foo1 (); xlsmp runtime call (foo3 _xlsmp_runtime_call (foo3, …) ) foo2 (); Single Source Compiler for Cell 15
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation Code Generation – – Summary Summary y y fo o 1 (); # p ra g m a o m p p a ra lle l fo r fo r (i= 0 ; i < N ; i+ + ) A [i] = x * B[i]; f 2 () fo o 2 (); fo o 1 (); R u n tim e d is trib u tio n o f w o rk : in v o k e fo o 3 , fo r i= [0 ,N ) R u n tim e b a rrie r fo o 2 (); o u tlin e foo3( L B, U B) foo3 foo3 _ SPU ( L B, U B) SPU ( L B, U B) fo r (i= L B; i < U B; i+ + ) A [i] = x * B[i]; R u n tim e b a rrie r fo r (i= L B; i < U B; R u n tim e b a rrie r i+ + ) c lo n e R u n tim e b a rrie r A [i] = x * B[i]; R u n tim e b a rrie r O v e rla y E n a b le d N o t e n a b le d P U Ba c k e n d C a ll G ra p h P a rtitio n in g S P U O b je c ts P U O b je c t s S PU B a c k e n d P U O M P R u n t im e L ib P U L in k e r S P U O b je c ts F in a l P U B in a ry O M P S PU Bin a ry E m b e d d e d a s P U O b je c t S P U L in k e r S P U O M P R u n tim e L ib S P U Bin a ry P P U - e m b e d s p u Single Source Compiler for Cell 16
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Outline ? Compiler Overview ? Compiler Overview ? Compiler in detail p ? Code Generation ? Thread and Synchronization y ? Data Management ? Conclusion Single Source Compiler for Cell 17
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Threads and Synchronization Threads and Synchronization – – Master Thread y y Master Thread ? OpenMP threads execute on both PPE and SPEs ? PPE can choose not to participate in work sharing ? PPE thread is the master thread ? PPE is designed to manage SPEs ? Smaller and simpler code for SPE runtime (in limited SPE local store) ? Responsibilities ? Creating SPE threads g ? Distributing and scheduling tasks ? Initiating synchronization operations ? Initiating synchronization operations ? Handling all OS service requests Single Source Compiler for Cell 18
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Threads and Synchronization Threads and Synchronization – – Tasks y y Tasks ? Some kind of work to be done by OpenMP threads ? E ? Execution of an outlined parallel region, loop, or section ti f tli d ll l i l ti ? Cache flush ? Barrier synchronization ? Task queue ? A task queue for each SPE thread ? Resides in system memory and contains all the tasks to be executed by the thread thread ? PPE thread creates tasks and puts tasks into the task queue ? Task type lower bound and upper bound for a parallel loop function ? Task type, lower bound and upper bound for a parallel loop, function pointer and etc. ? SPE threads fetch and execute tasks, and update task queue (using DMA) Single Source Compiler for Cell 19
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Threads and Synchronization Threads and Synchronization – – Synchronization y y Synchronization y y ? Follow OpenMP specification ? B ? Barriers, locks, cache flushes i l k h fl h ? Implementation ? Mail boxes in Memory Flow Controller (MFC) for each SPE ? For efficient communication of 32-bit values between cores ? Used by PPE thread to indicate a SPE thread that tasks are available and exactly how many of them are available ? Used in barrier implementation (Thanks to Dan Brokenshire) p ( ) ? Atomic unit in MFC for each SPE ? Implementing atomic DMA commands Implementing atomic DMA commands ? Used for efficient implementation of OpenMP locks and cache flush operations Single Source Compiler for Cell 20
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Execution Flow Execution Flow SPE PPE Ptr to func Lb and Ub initialize runtime Work type yp create SPE thread mail box mail box Loop and wait for work items set up SPE work queue execute PPE work share DMA update Work Item Queue work on received item synchronization synchronization h i ti Single Source Compiler for Cell 21
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Outline ? Compiler Overview ? Compiler Overview ? Compiler in detail p ? Code Generation ? Thread and Synchronization y ? Data Management ? Conclusion Single Source Compiler for Cell 22
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Data Management Data Management g g ? Memory model: Relaxed consistency, shared memory model ? E ? Each thread can have its own temporary view of memory h th d h it t i f ? Until forced to share memory by an OpenMP flush operation ? P i ? Private data allocated in SPE local store t d t ll t d i SPE l l t ? Easy access ? Shared data resides in system memory ? Not directly accessible by SPEs ? Global variable addressed through CESOF support ? Accessed through compiler-managed DMA operations ? All data accesses can be handled by software data cache ? Regular data accesses can be optimized with direct buffering Single Source Compiler for Cell 23
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Software Data Cache Software Data Cache ? Works just like a hardware cache, but implemented in software Loads/stores replaced with software cache lookup instructions Loads/stores replaced with software cache lookup instructions ? ? Miss handler invoked for a cache miss ? ? Brings in the missing cache line ? Evicts an existing cache line if necessary Currently fixed configuration (64KB,128B cache line, 4-way) for efficient implementation – will be made configurable later ? ? Coherence among threads ? One cache line may be shared by multiple SPE threads – cannot naively evict whole cache line ? Dirty bits to record modified data (in unit of byte) ? Dirty bits to record modified data (in unit of byte) ? Atomic updates based on dirty bits to evict a cache line ? Pros/Cons ? Uniform solution for all kinds of references ? Exploit data reuse dynamically ? Overhead ? Overhead Single Source Compiler for Cell 24
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Direct buffering Direct buffering g g ? Handles regular data accesses to shared memory by compiler ? Buffers in SPE local memory are controlled by compiler ? Buffers in SPE local memory are controlled by compiler ? Calls to allocate and free buffers are inserted ? DMA operations are inserted ? References to global variables are replaced by direct references to the local buffer ? References to global variables are replaced by direct references to the local buffer ? Pros/Cons ? Much less overhead: no lookup, more control on DMA ? Decisions have to be made at compile time ? Sometimes not possible ? Sometimes not optimal for (ii=0; i<N; i+=bf) { read part B into B’; read part C into C’; for (i=ii; i < min(ii+bf N); i++) { for (i=ii; i < min(ii+bf, N); i++) { for (i=0; i<N; i++) { A[i] = B[i]*C[i] } A’[i]=B’[i]*C’[i]; } write A’ back to A; } } Single Source Compiler for Cell 25
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Software Cache vs. Direct Buffering Software Cache vs. Direct Buffering g g 7 6 fer direct buff edup by d Spee 5 4 3 2 1 0 BT CG EP FT IS LU MG SP LBM Single Source Compiler for Cell 26
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Integrating Data Management Techniques Integrating Data Management Techniques g g g g g g q q SPE softcache direct buffer direct buffer Data set DMA Local Mem Main Memory ? What if multiple copies of data ? Software controlled cache and direct buffer ? Multiple direct buffers ? Incorrect result may be produced! Single Source Compiler for Cell 27
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Solution Exploration Solution Exploration p p C Compile-t time chec ck #2 #3 Run-tim me check ?Explore both compile-time and run-time check ? Compile-time method efficient ? Run-time method powerful ?Goal: better performance ?Refer to ICS08 paper and more coming Single Source Compiler for Cell 28
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Performance of Different Approaches Performance of Different Approaches pp pp 100% 80% 60% 60% 40% 20% 0% CG CG FT FT IS IS LU LU MG MG SP SP LBM LBM #1 #2 #3 #2+#3 •Performance normalized to no coherence maintenance overhead •#1 pure compiler approach •#2 compiler analysis and runtime boundary coherence maintenance #3 f ll •#3 full runtime coherence maintenance ti h i t Single Source Compiler for Cell 29
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Prefetching for Software Cache Prefetching for Software Cache g g ?The cache miss is expensive (more than 1000 cycle) ? Blocked DMA ? Blocked DMA ? Context switching for jumping to miss handler ? Cache maintenance ?Prefetching? ? Not hardware supported prefetch instruction ? Rely on DMA only ? How to sync the use of cache and prefetching on the fly ?Our solution ? Inspector-consumer model ? ? Feasible and efficient for DMA ff f ? Special optimizations ?Refer to CGO’08 paper ?Refer to CGO 08 paper Single Source Compiler for Cell 30
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Speedup for Loops by Prefetching Speedup for Loops by Prefetching p p p p p p y y g g 4.5 4 4 3.5 3 Speedup 2.5 2 1.5 1 CG/L-A CG/L-B FT/L-1 FT/L-2 FT/L-3 FT/L-4 FT/L-5 FT/L-6 IS/L-1 CG-A CG-B FT IS Single Source Compiler for Cell 31
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Conclusions Conclusions ? Supporting OpenMP on Cell facilitates code reuse and new application development pp p ? Our accomplishments ? For simple test cases our OpenMP compiler achieves performance similar to ? For simple test cases, our OpenMP compiler achieves performance similar to hand-optimized implementation ? For larger benchmarks, some of them show significant performance gains ? Feasible to extract high performance on Cell using easy-to-use OpenMP programming model ? Future work ? Optimize implementation ? Improve performance on a wider set of applications Single Source Compiler for Cell 32
http://w3.ibm.com/ibm/presentations Cell Broadband Engine For More Information For More Information Visit our website: www research ibm com/cellcompiler/compiler htm www.research.ibm.com/cellcompiler/compiler.htm Paper: ? Tong Chen Haibo Lin Tao Zhang Kathryn O’Brien Kevin O’Brien “Orchestrating Data ? Tong Chen, Haibo Lin, Tao Zhang, Kathryn O’Brien, Kevin O’Brien, “Orchestrating Data Transfer for the Cell/B.E.” accepted by International Conference on Supercomputer (ICS) 2008 ? Tong Chen, Tao Zhang, Zehra Sura, Marc Gonzalez Tallada, Kathryn O’Brien, Kevin O’Brien, , “Prefetch Irregular References for Software Cache on Cell”, International Symposium on Code Generation and Optimization (CGO) 2008. Generation and Optimization (CGO) 2008. ? Kevin O’Brien, Kathryn O’Brien, Zehra Sura, Tong Chen and Tao Zhang, “Supporting OpenMP on Cell”, International Workshop on OpenMP (IWOMP), 2007 ? Tong Chen, Zehra Sura, Kathryn M. O'Brien, John K. O'Brien: “Optimizing the Use of Static Buffers for DMA on a CELL Chip” Workshops on Language and Compiler for Parallel Buffers for DMA on a CELL Chip Workshops on Language and Compiler for Parallel Computation (LCPC) 2006: 314-329 ? Alexandre E. Eichenberger, Kathryn M. O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, Michael Gschwind: Optimizing Compiler for the CELL Processor. IEEE PACT 2005: Peng Zhao, Michael Gschwind: Optimizing Compiler for the CELL Processor. IEEE PACT 2005: 161-172 ? Alexandre E. Eichenberger, Peng Wu, Kevin O'Brien: Vectorization for SIMD architectures with alignment constraints. PLDI 2004: 82-93 Single Source Compiler for Cell 33
http://w3.ibm.com/ibm/presentations Cell Broadband Engine Backup Backup p p Single Source Compiler for Cell 34