Developing Code for Cell - SIMD

Developing Code for Cell - SIMD Cell Programming Workshop Cell Programming Workshop

Class Agenda • Vector Programming (SIMD) • Data types for vector programming • Application partitioning • SPU Intrinsics Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc. Cell Programming Workshop

PPE vs. SPE Comparison Cell Programming Workshop

PPE vs SPE • Both PPE and SPE execute SIMD instructions • PPE processes SIMD operations in the VXU within its PPU • SPEs process SIMD operations in their SPU • Both processors execute different instruction sets • Programs written for the PPE and SPEs must be compiled by different compilers Cell Programming Workshop

PPE and SPE Language-Extension Differences • Both operate on 128-bit SIMD vectors • Only the Vector/SIMD Multimedia Extension instruction set supports pixel vectors • Only the SPU instruction set supports doubleword vectors Cell Programming Workshop

Register Layout of Data Types and Preferred Slot When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot Cell Programming Workshop

SIMD Architecture • SIMD = “single-instruction multiple-data” • SIMD exploits data-level parallelism • a single instruction can apply the same operation to multiple data elements in parallel • SIMD units employ “vector registers” • each register holds multiple data elements • SIMD is pervasive in the BE • PPE includes VMX (SIMD extensions to PPC architecture) • SPE is a native SIMD architecture (VMX-like) • SIMD in VMX and SPE • 128bit-wide datapath • 128bit-wide registers • 4-wide fullwords, 8-wide halfwords, 16-wide bytes • SPE includes support for 2-wide doublewords Cell Programming Workshop

Vector code read instruction and decode it get these 4 numbers get those 4 numbers add them put the results here Instr. Decode Single Instruction Number 1a Number 1b Number 1c Number 1d + + + + Number 2a Number 2b Number 2c Number 2d Multiple Data = = = = Result 1 Result 2 Result 3 Result 4 SIMD (Single Instruction Multiple Data) processing • Data-level parallelism / operate on multiple data elements simultaneously Scalar code read the next instruction and decode it get this number get that number add them put the result here read the next instruction and decode it get this number get that number add them put the result here read the next instruction and decode it get this number get that number add them put the result here. read the next instruction and decode it get this number get that number add them put the result there Instr. Decode Number 1 + Repeat 4 times Number 2 = Result 1 Do Once • (Significantly) smaller amount of code => improved execution efficiency • Number of elements processed in parallel = (size of vector / size of element)‏ Cell Programming Workshop

I - Example of a SIMD operation In both the PPE and SPEs, vector registers hold multiple data elements as a single vector. The data paths and registers supporting SIMD operations are 128 bits wide, corresponding to four full 32-bit words. This means that four 32-bit words can be loaded into a single register, and, for example, added to four other words in a different register in a single operation. Similar operations can be performed on vector operands containing 16 bytes, 8 halfwords, or 2 doublewords. Cell Programming Workshop

SIMD “Cross-Element” Instructions • VMX and SPE architectures include “cross-element” instructions • shifts and rotates • permutes / shuffles • Permute / Shuffle • selects bytes from two source registers and places selected bytes in a target register • byte selection and placement controlled by a “control vector” in a third source register • extremely useful for reorganizing data in the vector register file Cell Programming Workshop

PowerPC Processor Element II - Example of a SIMD operation The bytes selected for the shuffle from the source registers, VA and VB, are based on byte entries in the control vector, VC, in which a 0 specifies VA and a 1 specifies VB. The result of the shuffle is placed in register VT. The shuffle operation is extremely powerful and finds its way into many applications in which data reordering, selection, or merging is required . byte-shuffle (permute) operation Cell Programming Workshop

PowerPC Processor Element SIMD Vectorization • A vectoris an instruction operand containing a set of data elements packed into a one-dimensional array • The elements can be fixed-point or floating-point values • SIMD processing exploits data-level parallelism: • Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel (SIMD) By the vector/SIMD multimedia extension instructions In the PPE Support for SIMD operations: In the SPEs By the SPU instruction set Cell Programming Workshop

Basic SIMD Operations – A PPU program Use altivec header file /* * declares vector variables which points to scalar arrays */ __vector signed int *va = (__vector signed int *) a; __vector signed int *vb = (__vector signed int *) b; __vector signed int *vc = (__vector signed int *) c; /* * adds four signed integers at once */ *vc = vec_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8 /* * output results */ printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]); return 0; } #include <stdio.h> #include <altivec.h> --- not required in IBM SDK for XLC /* * declares input/output scalar variables */ int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 }; int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 }; int c[4] __attribute__((aligned(16))); int main(int argc, char **argv) { Use altivec vector intrinsic instruction Source: PS3 Programming Cell Programming Workshop

Basic SIMD Operations – An SPU program /* * declares vector variables which points to scalar arrays */ __vector signed int *va = (__vector signed int *) a; __vector signed int *vb = (__vector signed int *) b; __vector signed int *vc = (__vector signed int *) c; /* * adds four signed intergers at once */ *vc = spu_add(*va, *vb); // 1 + 2, 3 + 4, 5 + 6, 7 + 8 // use of spu_add instead of vec_add /* * output results */ printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]); return 0; } Use spu_intrinsics header file #include <stdio.h> #include <spu_intrinsics.h> //the SPU intrinsics functions header file /* * declares input/output scalar variables */ int a[4] __attribute__((aligned(16))) = { 1, 3, 5, 7 }; int b[4] __attribute__((aligned(16))) = { 2, 4, 6, 8 }; int c[4] __attribute__((aligned(16))); int main(int argc, char **argv) { Use spu intrinsics function Source: PS3 Programming Cell Programming Workshop

Generic SPU Intrinsics • Generic • Constant formation (spu_splats) • Conversion (spu_convtf, spu_convts, ..) • Arithmetic (spu_add, spu_madd, spu_nmadd, ...) • Byte operations (spu_absd, spu_avg,...) • Compare and branch (spu_cmpeq, spu_cmpgt,...) • Bits and masks (spu_shuffle, spu_sel,...) • Logical (spu_and, spu_or, ...) • Shift and rotate (spu_rlqwbyte, spu_rlqw,...) • Control (spu_stop, spu_ienable, spu_idisable, ...) • Channel Control (spu_readch, spu_writech,...) • Scalar (spu_insert, spu_extract, spu_promote) Cell Programming Workshop

Vectorizing a Loop – A Simple Example • Loop does term-by-term multiply of two arrays • arrays assumed here to remain scalar outside of subroutine • If array size is not a multiple of 4, extra work is necessary Cell Programming Workshop

void load_data(int *dest) { int i; for (i=0; i<4096; ++i) { ++dest[i]; } } void load_data_SIMD(int *dest) { int i; vector unsigned int *vdest; vector unsigned int v1 = (vector unsigned int) {1, 1, 1, 1}; vdest = (vector unsigned int *) dest; for (i=0; i<1024; ++i) { vdest[i] = spu_add(vdest[i], v1); } } Another SIMD Example – data incrementing Cell Programming Workshop

A1 A1 A2 B2 B1 Q1 Q2 A2 I1 A2 I2 A1 c3 c1 a3 c1 a1 a1 b1 a3 a1 d1 a1 a3 b3 d3 d1 b3 d2 a2 b1 b2 b1 b3 b1 c2 d3 a4 c4 c2 a4 a2 b3 a2 a3 a4 c3 a2 b2 d4 a4 b2 d4 d2 c4 b4 b4 b2 b4 b4 Complex Multiplication SPE - Shuffle Vectors Z3 Z4 Z1 Z2 I P 0-3 8-11 16-19 24-27 Input Shuffle patterns Q P 4-7 12-15 20-23 28-31 W3 W4 W1 W2 I1 = spu_shuffle(A1, A2, I_Perm_Vector); I P 0-3 8-11 16-19 24-27 By analogy I2 = spu_shuffle(B1, B2, I_Perm_Vector); Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); Q P 4-7 12-15 20-23 28-31 Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); By analogy Cell Programming Workshop

I2 Q1 Z I1 Q2 Q2 Q1 Z I1 I2 a1 0 c1 d1 b1 c1 b1 0 d1 a1 b2 0 b2 a2 c2 c2 a2 d2 d2 0 b3 0 d3 b3 c3 a3 d3 a3 0 c3 0 c4 b4 a4 0 d4 b4 a4 c4 d4 Complex Multiplication * - A1 = spu_nmsub(Q1, Q2, v_zero); A2 = spu_madd(Q1, I2, v_zero); Q1 = spu_madd(I1, Q2, A2); I1 = spu_madd(I1, I2, A1); A1 -(b1*d1) -(b2*d2) -(b3*d3) -(b4*d4) * + A2 b1*c1 b2*c2 b3*c3 b4*c4 * A2 b1*c1 b2*c2 b3*c3 b4*c4 + a1*d1+ b1*c1 a2*d2+ b2*c2 a3*d3+ b3*c3 a4*d4+ b4*c4 Q1 * + A1 -(b1*d1) -(b2*d2) -(b3*d3) -(b4*d4) a1*c1- b1*d1 a2*c2- b2*d2 a3*c3- b3*d3 a4*c4- b4*d4 I1 Cell Programming Workshop

Complex multiplication – SPE - Summary vector float A1, A2, B1, B2, I1, I2, Q1, Q2, D1, D2; /* in-phase (real), quadrature (imag), temp, and output vectors*/ vector float v_zero = (vector float)(0,0,0,0); vector unsigned char I_Perm_Vector = (vector unsigned char)(0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27); vector unsigned char Q_Perm_Vector = (vector unsigned char)(4,5,6,7,12,13,14,15,20,21,22,23,28,29,30,31); vector unsigned char vcvmrgh = (vector unsigned char) (0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23); vector unsigned char vcvmrgl = (vector unsigned char) (8,9,10,11,24,25,26,27,12,13,14,15,28,29,30,31); /* input vectors are in interleaved form in A1,A2 and B1,B2 with each input vector representing 2 complex numbers and thus this loop would repeat for N/4 iterations */ I1 = spu_shuffle(A1, A2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors A1 and A2 */ I2 = spu_shuffle(B1, B2, I_Perm_Vector); /* pulls out 1st and 3rd 4-byte element from vectors B1 and B2 */ Q1 = spu_shuffle(A1, A2, Q_Perm_Vector); /* pulls out 2nd and 4th 4-byte element from vectors A1 and A2 */ Q2 = spu_shuffle(B1, B2, Q_Perm_Vector); /* pulls out 3rd and 4th 4-byte element from vectors B1 and B2 */ A1 = spu_nmsub(Q1, Q2, v_zero); /* calculates –(bd – 0) for all four elements */ A2 = spu_madd(Q1, I2, v_zero); /* calculates (bc + 0) for all four elements */ Q1 = spu_madd(I1, Q2, A2); /* calculates ad + bc for all four elements */ I1 = spu_madd(I1, I2, A1); /* calculates ac – bd for all four elements */ D1 = spu_shuffle(I1, Q1, vcvmrgh); /* spreads the results back into interleaved format */ D2 = spu_shuffle(I1, Q1, vcvmrgl); /* spreads the results back into interleaved format */ Cell Programming Workshop

Hands-on SIMD Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement Cell Programming Workshop

Class Agenda • Example #1 - Develop and run a simple PPU scalar program • Example #2 – Converting to an SPU program with data alignment • Example #3 – Static profiling with spu_timing • Learn how to SIMDize scalar code on CellBE Coding examples/opt/cell_class/Hands-on-30/SIMD Cell Programming Workshop

Coding Strategy • Develop 2 functions • mult1.c to multiply two arrays of SP floats together. This program will be used to characterize the SIMD features • main.c is a non-vector driver program Cell Programming Workshop

#include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N]; int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; for (i=0;i<N;i++) { out[i] = in1[i] * in2[i]; } return 0; } example1 - SIMD main.c mult1.c Cell Programming Workshop

#include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; for (i=0;i<N;i++) { out[i] = in1[i] * in2[i]; } return 0; } Example1.b - SIMD main.c mult1.c Cell Programming Workshop

#include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> #include <spu_intrinsics.h> int mult1(float *in1, float *in2, float *out, int N) { int i; vector float *a = (vector float *) in1; vector float *b = (vector float *) in2; vector float *c = (vector float *) out; int Nv = N/4; // NV=4 for (i=0;i<Nv;i++) { c[i] = spu_mul(a[i], b[i]); } return 0; } ! Example #2 – Converting to an SPU main.c mult1.c Cell Programming Workshop

Example #3 – Static profiling with spu_timing Cell Programming Workshop

What is the SPU timing tool • Annotates an assembly source file with static analysis of instruction timing assuming linear (branchless) execution. C source code (example.c) spu-gcc –S example.corspuxlc –S example.c Assembly source (example.s) spu_timing example.s Annotated assembly source (example.s.timing) Cell Programming Workshop

Sample - assembly source (example.s) .file "example.c“.text .align 3 .global saxpy .type saxpy, @functionsaxpy: ila $2,66051 shlqbyi $7,$3,0 cgti $3,$3,0 shufb $8,$4,$4,$2 nop $127 biz $3,$lr ori $4,$7,0 hbra .L8,.L4 il $7,0 lnop.L4: ai $4,$4,-1 lqx $2,$7,$5 lqx $3,$7,$6 fma $2,$8,$2,$3 stqx $2,$7,$6 ai $7,$7,16.L8: brnz $4,.L4 bi $lr Cell Programming Workshop

Example .timing example.s.timing .file "mult1.c" .text .align 3 .global mult1 .type mult1, @function mult1: 0D 01 cgti $2,$6,-1 1D 0123 shlqbyi $10,$3,0 0D 12 ai $3,$6,3 1D 1234 shlqbyi $9,$4,0 0 -34 selb $3,$3,$6,$2 0 -5678 rotmai $8,$3,-2 0d ---90 cgti $2,$8,0 1d --1234 brnz $2,.L8 .L2: 0D 23 il $3,0 1D 2345 bi $lr .L8: 0D 34 il $7,0 1D 345678 hbra .L9,.L4 0D 45 il $6,0 1D 4 lnop .L4: 0d 56 ai $7,$7,1 1d -678901 lqx $2,$10,$6 1 789012 lqx $3,$9,$6 0 89 ceq $4,$8,$7 0d ----345678 fm $2,$2,$3 1d ------901234 stqx $2,$5,$6 0D 01 ai $6,$6,16 .L9: 1D 0123 brz $4,.L4 0D 1 nop $127 1D 1234 br .L2 .size mult1, .-mult1 .ident "GCC: (GNU) 4.0.2 (CELL 3-2, Apr 11 2006)" Multiplication loop spu_mul stalling on load Cell Programming Workshop

Sample – annotated source (example.s.timing) Inner Loop running-count – cycle count for which each instruction starts. Useful for determining the cycles in a loop. For example, our loop is 17 cycles (23-6). .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

Sample – annotated source (example.s.timing) Execution pipeline – the pipeline in which the instruction is issued. Either 0 (even pipeline) or 1 (odd pipeline). .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

Sample – annotated source (example.s.timing) dual-issue status – blank indicates single issue. .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr d – indicates that for the pair of instructions, dual-issue is possible D – indicates the pair of instructions will be dual-issued. Cell Programming Workshop

Sample – annotated source (example.s.timing) Instruction clock cycle occupancy – A digit (0-9) is displayed for every clock cycle the instruction executes. Operand dependency stalls are flagged by a dash (“-”) for every clock cycle the instruction is expect to stall. .file "example.c“ .text .align 3 .global saxpy .type saxpy, @function saxpy:000000 0D 01 ila $2,66051000000 1D 0123 shlqbyi $7,$3,0000001 0d 12 cgti $3,$3,0000002 1d -2345 shufb $8,$4,$4,$2000003 0D 3 nop $127000003 1D 3456 biz $3,$lr000004 0D 45 ori $4,$7,0000004 1D 456789 hbra .L8,.L4000005 0D 56 il $7,0000005 1D 5 lnop .L4:000006 0d 67 ai $4,$4,-1000007 1d -789012 lqx $2,$7,$5000008 1 890123 lqx $3,$7,$6000014 0 -----456789 fma $2,$8,$2,$3000020 1 -----012345 stqx $2,$7,$6 .L8:000022 1 2345 brnz $4,.L4000023 1 3456 bi $lr Cell Programming Workshop

SIMD - Example 5 in /opt/cell_class/Hands-on-30/SIMD • By setting SPU_TIMING = 1 as either an environment variable or in the SPU Makefile and executing: make <spu_source>.s a static timing report is generated with the extension .s.timing • Another way is to issue the command SPU_TIMING=1 make <spu_source>.s Cell Programming Workshop

#include <stdio.h> #define N 16 int mult1(float *in1, float *in2, float *out, int num); float a[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 2.2, 3.3, 3.3, 2.2, 5.5, 6.6, 6.6, 5.5}; float b[N] __attribute__ ((aligned(16))) = { 1.1, 2.2, 4.4, 5.5, 5.5, 6.6, 6.6, 5.5, 2.2, 3.3, 3.3, 2.2, 6.6, 7.7, 8.8, 9.9}; float c[N] __attribute__ ((aligned(16))); int main() { int num = N; int i; mult1(a, b, c, num); for (i=0;i<N;i+=4) printf("%.2f %.2f %.2f %.2f\n", c[i], c[i+1], c[i+2], c[i+3]); return 0; } #include <stdio.h> int mult1(float *in1, float *in2, float *out, int N) { int i; vector float *a = (vector float *) in1; vector float *b = (vector float *) in2; vector float *c = (vector float *) out; int Nv = N/4; for (i=0;i<Nv;i++) { c[i] = spu_mul(a[i], b[i]); } return 0; } example3 – SIMD spu_timing main.c PROGRAM_spu = example5 SPU_TIMING = 1 include $(CELL_TOP)/buildutils/make.footer Makefile ! mult1.c Cell Programming Workshop

Developing Code for Cell - SIMD

Developing Code for Cell - SIMD

Presentation Transcript

Optimizing Data Permutations for SIMD Devices

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

New Algorithms for SIMD Alignment

SIMD Processor Extensions

SIMD

Intel SIMD architecture

Intel SIMD architecture

Developing Software for the PS3/Cell Platform

SIMD Architectures

Bottlenecks of SIMD

Intel SIMD architecture

A Cell-by-Cell AMR Method for the PPM Hydrodynamics Code

Streaming SIMD Extensions

Intel SIMD architecture

Intel SIMD architecture

Developing Models in Virtual Cell

Intel SIMD architecture