1 / 36

Developing Code for Cell - SIMD

Developing Code for Cell - SIMD. Cell Programming Workshop. Class Agenda. Vector Programming (SIMD) Data types for vector programming Application partitioning SPU Intrinsics. Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc. PPE vs. SPE Comparison.

Télécharger la présentation

Developing Code for Cell - SIMD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Code for Cell - SIMD Cell Programming Workshop Cell Programming Workshop

  2. Class Agenda • Vector Programming (SIMD) • Data types for vector programming • Application partitioning • SPU Intrinsics Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc. Cell Programming Workshop

  3. PPE vs. SPE Comparison Cell Programming Workshop

  4. PPE vs SPE • Both PPE and SPE execute SIMD instructions • PPE processes SIMD operations in the VXU within its PPU • SPEs process SIMD operations in their SPU • Both processors execute different instruction sets • Programs written for the PPE and SPEs must be compiled by different compilers Cell Programming Workshop

  5. PPE and SPE Language-Extension Differences • Both operate on 128-bit SIMD vectors • Only the Vector/SIMD Multimedia Extension instruction set supports pixel vectors • Only the SPU instruction set supports doubleword vectors Cell Programming Workshop

  6. Register Layout of Data Types and Preferred Slot When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot Cell Programming Workshop

  7. SIMD Architecture • SIMD = “single-instruction multiple-data” • SIMD exploits data-level parallelism • a single instruction can apply the same operation to multiple data elements in parallel • SIMD units employ “vector registers” • each register holds multiple data elements • SIMD is pervasive in the BE • PPE includes VMX (SIMD extensions to PPC architecture) • SPE is a native SIMD architecture (VMX-like) • SIMD in VMX and SPE • 128bit-wide datapath • 128bit-wide registers • 4-wide fullwords, 8-wide halfwords, 16-wide bytes • SPE includes support for 2-wide doublewords Cell Programming Workshop

  8. Vector code read instruction and decode it get these 4 numbers get those 4 numbers add them put the results here Instr. Decode Single Instruction Number 1a Number 1b Number 1c Number 1d + + + + Number 2a Number 2b Number 2c Number 2d Multiple Data = = = = Result 1 Result 2 Result 3 Result 4 SIMD (Single Instruction Multiple Data) processing • Data-level parallelism / operate on multiple data elements simultaneously Scalar code read the next instruction and decode it get this number get that number add them put the result here read the next instruction and decode it get this number get that number add them put the result here read the next instruction and decode it get this number get that number add them put the result here. read the next instruction and decode it get this number get that number add them put the result there Instr. Decode Number 1 + Repeat 4 times Number 2 = Result 1 Do Once • (Significantly) smaller amount of code => improved execution efficiency • Number of elements processed in parallel = (size of vector / size of element)‏ Cell Programming Workshop

  9. I - Example of a SIMD operation In both the PPE and SPEs, vector registers hold multiple data elements as a single vector. The data paths and registers supporting SIMD operations are 128 bits wide, corresponding to four full 32-bit words. This means that four 32-bit words can be loaded into a single register, and, for example, added to four other words in a different register in a single operation. Similar operations can be performed on vector operands containing 16 bytes, 8 halfwords, or 2 doublewords. Cell Programming Workshop

  10. SIMD “Cross-Element” Instructions • VMX and SPE architectures include “cross-element” instructions • shifts and rotates • permutes / shuffles • Permute / Shuffle • selects bytes from two source registers and places selected bytes in a target register • byte selection and placement controlled by a “control vector” in a third source register • extremely useful for reorganizing data in the vector register file Cell Programming Workshop

  11. PowerPC Processor Element II - Example of a SIMD operation The bytes selected for the shuffle from the source registers, VA and VB, are based on byte entries in the control vector, VC, in which a 0 specifies VA and a 1 specifies VB. The result of the shuffle is placed in register VT. The shuffle operation is extremely powerful and finds its way into many applications in which data reordering, selection, or merging is required . byte-shuffle (permute) operation Cell Programming Workshop

  12. PowerPC Processor Element SIMD Vectorization • A vectoris an instruction operand containing a set of data elements packed into a one-dimensional array • The elements can be fixed-point or floating-point values • SIMD processing exploits data-level parallelism: • Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel (SIMD) By the vector/SIMD multimedia extension instructions In the PPE Support for SIMD operations: In the SPEs By the SPU instruction set Cell Programming Workshop

  13. Basic SIMD Operations – A PPU program Use altivec header file /* * declares vector variables which points to scalar arrays */ __vector signed int *va = (__vector signed int *) a; __vector signed int *vb = (__vector signed int *) b; __vector signed int *vc = (__vector signed int *) c; /* * adds four signed integers at once */ *vc = vec_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8 /* * output results */ printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]); return 0; } #include <stdio.h> #include <altivec.h> --- not required in IBM SDK for XLC /* * declares input/output scalar variables */ int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 }; int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 }; int c[4] __attribute__((aligned(16))); int main(int argc, char **argv) { Use altivec vector intrinsic instruction Source: PS3 Programming Cell Programming Workshop

  14. Basic SIMD Operations – An SPU program /* * declares vector variables which points to scalar arrays */ __vector signed int *va = (__vector signed int *) a; __vector signed int *vb = (__vector signed int *) b; __vector signed int *vc = (__vector signed int *) c; /* * adds four signed intergers at once */ *vc = spu_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8 // use of spu_add instead of vec_add /* * output results */ printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]); return 0; } Use spu_intrinsics header file #include <stdio.h> #include <spu_intrinsics.h> //the SPU intrinsics functions header file /* * declares input/output scalar variables */ int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 }; int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 }; int c[4] __attribute__((aligned(16))); int main(int argc, char **argv) { Use spu intrinsics function Source: PS3 Programming Cell Programming Workshop

  15. Generic SPU Intrinsics • Generic • Constant formation (spu_splats) • Conversion (spu_convtf, spu_convts, ..) • Arithmetic (spu_add, spu_madd, spu_nmadd, ...) • Byte operations (spu_absd, spu_avg,...) • Compare and branch (spu_cmpeq, spu_cmpgt,...) • Bits and masks (spu_shuffle, spu_sel,...) • Logical (spu_and, spu_or, ...) • Shift and rotate (spu_rlqwbyte, spu_rlqw,...) • Control (spu_stop, spu_ienable, spu_idisable, ...) • Channel Control (spu_readch, spu_writech,...) • Scalar (spu_insert, spu_extract, spu_promote) Cell Programming Workshop

  16. Vectorizing a Loop – A Simple Example • Loop does term-by-term multiply of two arrays • arrays assumed here to remain scalar outside of subroutine • If array size is not a multiple of 4, extra work is necessary Cell Programming Workshop

  17. void load_data(int *dest) { int i; for (i=0; i<4096; ++i) { ++dest[i]; } } void load_data_SIMD(int *dest) { int i; vector unsigned int *vdest; vector unsigned int v1 = (vector unsigned int) {1, 1, 1, 1}; vdest = (vector unsigned int *) dest; for (i=0; i<1024; ++i) { vdest[i] = spu_add(vdest[i], v1); } } Another SIMD Example – data incrementing Cell Programming Workshop

  18. A1 A1 A2 B2 B1 Q1 Q2 A2 I1 A2 I2 A1 c3 c1 a3 c1 a1 a1 b1 a3 a1 d1 a1 a3 b3 d3 d1 b3 d2 a2 b1 b2 b1 b3 b1 c2 d3 a4 c4 c2 a4 a2 b3 a2 a3 a4 c3 a2 b2 d4 a4 b2 d4 d2 c4 b4 b4 b2 b4 b4 Complex Multiplication SPE - Shuffle Vectors Z3 Z4 Z1 Z2 I P 0-3 8-11 16-19 24-27 Input Shuffle patterns Q P 4-7 12-15 20-23 28-31 W3 W4 W1 W2 I1 = spu_shuffle(A1, A2, I_Perm_Vector); I P 0-3 8-11 16-19 24-27 By analogy I2 = spu_shuffle(B1, B2, I_Perm_Vector); Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); Q P 4-7 12-15 20-23 28-31 Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); By analogy Cell Programming Workshop

  19. I2 Q1 Z I1 Q2 Q2 Q1 Z I1 I2 a1 0 c1 d1 b1 c1 b1 0 d1 a1 b2 0 b2 a2 c2 c2 a2 d2 d2 0 b3 0 d3 b3 c3 a3 d3 a3 0 c3 0 c4 b4 a4 0 d4 b4 a4 c4 d4 Complex Multiplication * - A1 = spu_nmsub(Q1, Q2, v_zero); A2 = spu_madd(Q1, I2, v_zero); Q1 = spu_madd(I1, Q2, A2); I1 = spu_madd(I1, I2, A1); A1 -(b1*d1) -(b2*d2) -(b3*d3) -(b4*d4) * + A2 b1*c1 b2*c2 b3*c3 b4*c4 * A2 b1*c1 b2*c2 b3*c3 b4*c4 + a1*d1+ b1*c1 a2*d2+ b2*c2 a3*d3+ b3*c3 a4*d4+ b4*c4 Q1 * + A1 -(b1*d1) -(b2*d2) -(b3*d3) -(b4*d4) a1*c1- b1*d1 a2*c2- b2*d2 a3*c3- b3*d3 a4*c4- b4*d4 I1 Cell Programming Workshop

  20. Complex multiplication – SPE - Summary vector float A1, A2, B1, B2, I1, I2, Q1, Q2, D1, D2; /* in-phase (real), quadrature (imag), temp, and output vectors*/ vector float v_zero = (vector float)(0,0,0,0); vector unsigned char I_Perm_Vector = (vector unsigned char)(0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27); vector unsigned char Q_Perm_Vector = (vector unsigned char)(4,5,6,7,12,13,14,15,20,21,22,23,28,29,30,31); vector unsigned char vcvmrgh = (vector unsigned char) (0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23); vector unsigned char vcvmrgl = (vector unsigned char) (8,9,10,11,24,25,26,27,12,13,14,15,28,29,30,31); /* input vectors are in interleaved form in A1,A2 and B1,B2 with each input vector representing 2 complex numbers and thus this loop would repeat for N/4 iterations */ I1 = spu_shuffle(A1, A2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors A1 and A2 */ I2 = spu_shuffle(B1, B2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors B1 and B2 */ Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); /* pulls out 2nd and 4th 4-byte element from vectors A1 and A2 */ Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); /* pulls out 3rd and 4th 4-byte element from vectors B1 and B2 */ A1 = spu_nmsub(Q1, Q2, v_zero); /* calculates –(bd – 0) for all four elements */ A2 = spu_madd(Q1, I2, v_zero); /* calculates (bc + 0) for all four elements */ Q1 = spu_madd(I1, Q2, A2); /* calculates ad + bc for all four elements */ I1 = spu_madd(I1, I2, A1); /* calculates ac – bd for all four elements */ D1 = spu_shuffle(I1, Q1, vcvmrgh); /* spreads the results back into interleaved format */ D2 = spu_shuffle(I1, Q1, vcvmrgl); /* spreads the results back into interleaved format */ Cell Programming Workshop

  21. Hands-on SIMD Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement Cell Programming Workshop

  22. Class Agenda • Example #1 - Develop and run a simple PPU scalar program • Example #2 – Converting to an SPU program with data alignment • Example #3 – Static profiling with spu_timing • Learn how to SIMDize scalar code on CellBE Coding examples/opt/cell_class/Hands-on-30/SIMD Cell Programming Workshop

  23. Coding Strategy • Develop 2 functions • mult1.c to multiply two arrays of SP floats together. This program will be used to characterize the SIMD features • main.c is a non-vector driver program Cell Programming Workshop

  24. #include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N]; int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; for (i=0;i<N;i++) { out[i] = in1[i] * in2[i]; } return 0; } example1 - SIMD main.c mult1.c Cell Programming Workshop

  25. #include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; for (i=0;i<N;i++) { out[i] = in1[i] * in2[i]; } return 0; } Example1.b - SIMD main.c mult1.c Cell Programming Workshop

  26. #include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> #include <spu_intrinsics.h> int mult1(float *in1, float *in2, float *out, int N) { int i; vector float *a = (vector float *) in1; vector float *b = (vector float *) in2; vector float *c = (vector float *) out; int Nv = N/4; // NV=4 for (i=0;i<Nv;i++) { c[i] = spu_mul(a[i], b[i]); } return 0; } ! Example #2 – Converting to an SPU main.c mult1.c Cell Programming Workshop

  27. Example #3 – Static profiling with spu_timing Cell Programming Workshop

  28. What is the SPU timing tool • Annotates an assembly source file with static analysis of instruction timing assuming linear (branchless) execution. C source code (example.c) spu-gcc –S example.corspuxlc –S example.c Assembly source (example.s) spu_timing example.s Annotated assembly source (example.s.timing) Cell Programming Workshop

  29. Sample - assembly source (example.s) .file "example.c“.text .align 3 .global saxpy .type saxpy, @functionsaxpy: ila $2,66051 shlqbyi $7,$3,0 cgti $3,$3,0 shufb $8,$4,$4,$2 nop $127 biz $3,$lr ori $4,$7,0 hbra .L8,.L4 il $7,0 lnop.L4: ai $4,$4,-1 lqx $2,$7,$5 lqx $3,$7,$6 fma $2,$8,$2,$3 stqx $2,$7,$6 ai $7,$7,16.L8: brnz $4,.L4 bi $lr Cell Programming Workshop

  30. Example .timing example.s.timing .file "mult1.c" .text .align 3 .global mult1 .type mult1, @function mult1: 0D 01 cgti $2,$6,-1 1D 0123 shlqbyi $10,$3,0 0D 12 ai $3,$6,3 1D 1234 shlqbyi $9,$4,0 0 -34 selb $3,$3,$6,$2 0 -5678 rotmai $8,$3,-2 0d ---90 cgti $2,$8,0 1d --1234 brnz $2,.L8 .L2: 0D 23 il $3,0 1D 2345 bi $lr .L8: 0D 34 il $7,0 1D 345678 hbra .L9,.L4 0D 45 il $6,0 1D 4 lnop .L4: 0d 56 ai $7,$7,1 1d -678901 lqx $2,$10,$6 1 789012 lqx $3,$9,$6 0 89 ceq $4,$8,$7 0d ----345678 fm $2,$2,$3 1d ------901234 stqx $2,$5,$6 0D 01 ai $6,$6,16 .L9: 1D 0123 brz $4,.L4 0D 1 nop $127 1D 1234 br .L2 .size mult1, .-mult1 .ident "GCC: (GNU) 4.0.2 (CELL 3-2, Apr 11 2006)" Multiplication loop spu_mul stalling on load Cell Programming Workshop

  31. Sample – annotated source (example.s.timing) Inner Loop running-count – cycle count for which each instruction starts. Useful for determining the cycles in a loop. For example, our loop is 17 cycles (23-6). .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

  32. Sample – annotated source (example.s.timing) Execution pipeline – the pipeline in which the instruction is issued. Either 0 (even pipeline) or 1 (odd pipeline). .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

  33. Sample – annotated source (example.s.timing) dual-issue status – blank indicates single issue. .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr d – indicates that for the pair of instructions, dual-issue is possible D – indicates the pair of instructions will be dual-issued. Cell Programming Workshop

  34. Sample – annotated source (example.s.timing) Instruction clock cycle occupancy – A digit (0-9) is displayed for every clock cycle the instruction executes. Operand dependency stalls are flagged by a dash (“-”) for every clock cycle the instruction is expect to stall. .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

  35. SIMD - Example 5 in /opt/cell_class/Hands-on-30/SIMD • By setting SPU_TIMING = 1 as either an environment variable or in the SPU Makefile and executing: make <spu_source>.s a static timing report is generated with the extension .s.timing • Another way is to issue the command SPU_TIMING=1 make <spu_source>.s Cell Programming Workshop

  36. #include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; vector float *a = (vector float *) in1; vector float *b = (vector float *) in2; vector float *c = (vector float *) out; int Nv = N/4; for (i=0;i<Nv;i++) { c[i] = spu_mul(a[i], b[i]); } return 0; } example3 – SIMD spu_timing main.c PROGRAM_spu = example5 SPU_TIMING = 1 include $(CELL_TOP)/buildutils/make.footer Makefile ! mult1.c Cell Programming Workshop

More Related