1 / 59

C Optimization Techniques for StarCore SC140

C Optimization Techniques for StarCore SC140. Created by Bogdan Costinescu. Agenda. C and DSP Software Development Compiler Extensions for DSP Programming Coding Style and DSP Programming Techniques DSP Coding Recommendations Conclusions. What do optimizations mean?.

hanne
Télécharger la présentation

C Optimization Techniques for StarCore SC140

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C Optimization Techniques for StarCore SC140 Created by Bogdan Costinescu

  2. Agenda • C and DSP Software Development • Compiler Extensions for DSP Programming • Coding Style and DSP Programming Techniques • DSP Coding Recommendations • Conclusions

  3. What do optimizations mean? • reduction in cycle count • reduction in code size • reduction in data structures • reduction in stack memory size • reduction of power consumption • how do we measure if we did it right?

  4. Why C in DSP Programming? • C is a "structured programming language designed to produce a compact and efficient translation of a program into machine language” • high-level, but close to assembly • large set of tools available • flexible, maintainable, portable => reusability

  5. Compiler-friendly Architectures • large set of registers • orthogonal register file • flexible addressing modes • few restrictions • native support for several data types

  6. Example - StarCore SC140 • Variable Length Execution Set architecture • one execution set can hold up to: • four DALU instructions • two AGU instructions • scalable architecture with rich instruction set • large orthogonal register file • 16 DALU • 16 + 4 + 4 AGU • short pipeline => few restrictions • direct support for integers and fractionals

  7. Is C a Perfect Match for DSPs? • fixed-point not supported • no support for dual-memory architectures • no port I/O or special instructions • no support for SIMD instructions • vectors represented by pointers

  8. C Support for Parallelism • basically, NONE • C = a sequential programming language • cannot express actions in parallel • no equivalent of SIMD instructions • the compiler has to analyze the scheduling of independent operations • the analysis depends on the coding style

  9. Can C be made efficient for DSPs? YES, IF the programmer understands • compiler extensions • programming techniques • compiler behavior

  10. Compiler Extensions • intrinsics • access efficiently the target instruction set • new data types • use processor-specific data formats in C • pragmas • providing extra-information to the compiler • inline assembly • the solution to “impossible in C” problem

  11. Intrinsic functions (Intrinsics) • special compiler-translated function calls • map to one or more target instructions • extend the set of operators in C • define new types • simple and double precision fractional • extended precision fractionals • emulation possible on other platforms

  12. Data Types • DSPs may handle several data formats • identical with C basic types • integers • common length, but different semantics • fractionals on 16 and 32 bits • particular DSP (i.e. including the extension) • fractionals on 40 bits • all DSP specific data types come with a set of operators

  13. C Types for Fractionals • conventions: • Word16 for 16 bit fractionals • Word32 for 32 bit fractionals • Word40 for 40 bit fractionals (32 bits plus 8 guard bits) • Word64 for 64 bit fractionals (not direct representation on SC140) • intrinsics, not conventions, impose the fractional property on values

  14. Fractional Emulation • needed on non-StarCore platforms • maintain the meaning of fractional operations • in some limit cases, the emulated results are not correct • e.g. mpysu uses 32-bits for results plus sign • special Metrowerks intrinsics to solve possible inconsistencies • e.g. mpysu_shr16

  15. Fractional Values and Operators • introduced by fractional intrinsics • operators standardized (ITU-T, ETSI) • addition, multiplication • saturation, rounding • scaling, normalization • operate on specific data types Word16 add(Word16, Word16); Word32 L_add(Word32, Word32); Word40 X_add(Word40, Word40); Word64 D_add(Word64, Word64);

  16. Pragmas • standard mode to speak with compilers • describe code or data characteristics • pragmas do not influence the correctness • but influence significantly the performance #pragma align signal 8 #pragma loop_unroll 4 #pragma inline #pragma loop_count (20, 40, 4)

  17. Alignment pragma • #pragma align *ptr 8 • for parameters • #pragma align ptr 8 • for vector declaration and definition • derived pointers may not inherit the alignment Word16 *p = aligned_ptr + 24; /* not recommended */ #define p (aligned_ptr + 24) /* recommended */ • create a new function if you want to specify alignment for internal pointers

  18. Inline Assembly • whole functions • only one instruction • asm(“ di”); • difficult access to local function symbols • global can be accessed by their name • the asm statements block the C optimization process

  19. Inline Assembly • access to special target features not available through intrinsics • special calling convention • the programmer has more freedom • fast assembly integration with C • easy integration of legacy code • two types • one instruction, e.g. asm(“ di"); • a whole function

  20. Inline Assembly - Example asm Flag GetOverflow(void) { asm_header return in $d0; .reg $d0,$d1; asm_body clr d0 ; extra cycle needed to allow DOVF to ; be written even by the instructions ; in the delay slot of JSRD bmtsts #$0004,EMR.L ; test the Overflow bit from EMR move.w #1,d1 tfrt d1,d0 ; if Overflow return 1, else return 0 asm_end }

  21. Compiler Optimizations • standard C optimizations • low level (target) optimizations • several optimization levels available • fastest code • smallest code • global optimizations • all files are analyzed as a whole • the best performance

  22. DSP Application Success Factors • high-performance, compiler-friendly target • good optimizing compiler • good C coding style in source code => the only factor tunable by the programmer is the C coding style • coding style includes programming techniques • no compiler can defeat from GIGO

  23. Programming Techniques • described in the compiler’s manual • styles to be used • pitfalls to be avoided • explicitly eliminate data dependencies • complement compiler’s transformations that restructure the code

  24. SC140 Programming Techniques • general DSP programming techniques • loop merging • loop unrolling • loop splitting • specific to StarCore and its generation • split computation • multisample

  25. Loop Merging • combines two loops in a single one • precondition: the same number of iterations • may eliminate inter-loop storage space • increases cache locality • decreases loop overhead • increases DALU usage per loop iteration • typically, reduces code size

  26. Loop Merging Example /* scaling loop */ for ( i = 0; i < SIG_LEN; i++) { y[i] = shr(y[i], 2); } /* energy computation */ e = 0; for ( i = 0; i < SIG_LEN; i++) { e = L_mac(e, y[i], y[i]); } /* Compute in the same time the */ /* energy of the scaled signal */ e = 0; for (i = 0; i < SIG_LEN; i ++) { Word16 temp; temp = shr(y[i], 2); e = L_mac(e, temp, temp); y[i] = temp; }

  27. Loop Unrolling • repeats the body of a loop • increases the DALU usage per loop step • increases the code size • depends on • data dependencies between iterations • data alignment of implied vectors • register pressure per iteration • number of iterations to be performed • keeps the bit-exactness

  28. Loop Unrolling Example int i; Word16 signal[SIGNAL_SIZE]; Word16 scaled_signal[SIGNAL_SIZE]; /* ... */ for(i=0; i<SIGNAL_SIZE; i++) { scaled_signal[i] = shr(signal[i], 2); } /* ... */ int i; Word16 signal[SIGNAL_SIZE]; #pragma align signal 8 Word16 scaled_signal[SIGNAL_SIZE]; #pragma align scaled_signal 8 /* ... */ for(i=0; i<SIGNAL_SIZE; i+=4) { scaled_signal[i+0] = shr(signal[i+0], 2); scaled_signal[i+1] = shr(signal[i+1], 2); scaled_signal[i+2] = shr(signal[i+2], 2); scaled_signal[i+3] = shr(signal[i+3], 2); } /* ... */

  29. Compiler generated code ; With loop unrolling : 2 cycles inside ; the loop. ; The complier uses software pipelining [ move.4f (r0)+,d0:d1:d2:d3 move.l #_scaled_signal,r1 ] [ asrr #<2,d0 asrr #<2,d3 asrr #<2,d1 asrr #<2,d2 ] loopstart3 [ moves.4f d0:d1:d2:d3,(r1)+ move.4f (r0)+,d0:d1:d2:d3 ] [ asrr #<2,d0 asrr #<2,d1 asrr #<2,d2 asrr #<2,d3 ] loopend3 moves.4f d0:d1:d2:d3,(r1)+ ; Without loop unrolling : 6 cycles ; inside the loop loopstart3 L5 [ add #<2,d3 ;[20,1] move.l d3,r2;[20,1] move.l <_signal,r1 ;[20,1] ] move.l <_scaled_signal,r4 adda r2,r1 [ move.w (r1),d4 adda r2,r4 ] asrr #<2,d4 move.w d4,(r4) loopend3

  30. Why to use loop unrolling? • increases DALU usage per loop step • explicit specification of value reuse • elimination of “uncompressible” cycles • destination the same as the source

  31. Loop unroll factor • difficult to provide a formula • depends on various parameters • data alignment • number of memory operations • number of computations • possible value reuse • number of iterations in the loop • unroll the loop until you get no gain • two and four are typical unroll factors • found also three in max_ratio benchmark

  32. Loop Unroll Factor – example 1 #include <prototype.h> #define VECTOR_SIZE 40 Word16 example1(Word16 a[], Word16 incr) { short i; Word16 b[VECTOR_SIZE]; for(i = 0; i < VECTOR_SIZE; i++) { b[i] = add(a[i], incr); } return b[0]; }

  33. Loop Unroll Factor - example 2 #include <prototype.h> #define VECTOR_SIZE 40 Word16 example2(Word16 a[], Word16 incr) { #pragma align *a 8 int i; Word16 b[VECTOR_SIZE]; #pragma align b 8 for(i = 0; i < VECTOR_SIZE; i++) { b[i] = add(a[i], incr); } return b[0]; } ; compiler generated code move.4f (r0)+,d4:d5:d6:d7 ... loopstart3 [ add d4,d8,d12 add d5,d8,d13 add d6,d8,d14 add d7,d8,d15 moves.4f d12:d13:d14:d15,(r1)+ move.4f (r0)+,d4:d5:d6:d7 ] loopend3

  34. Loop Unroll Factor - example 3 #include <prototype.h> #define VECTOR_SIZE 40 Word16 example3(Word16 a[], Word16 incr) { #pragma align *a 8 int i; Word16 b[VECTOR_SIZE]; #pragma align b 8 for(i = 0; i < VECTOR_SIZE; i++) { b[i] = a[i] >> incr; } return b[0]; } ; compiler generated code loopstart3 [ move.4w d4:d5:d6:d7,(r1)+ move.4w (r0)+,d4:d5:d6:d7 ] [ asrr d1,d4 asrr d1,d5 asrr d1,d6 asrr d1,d7 ] loopend3

  35. Split Computation • typically used in associative reductions • energy • mean square error • maximum • used to increase DALU usage • increases the code size • may require alignment properties • may influence the bit-exactness

  36. Split Computation Example int i; Word32 e; Word16 signal[SIGNAL_SIZE]; /* ... */ e = 0; for(i = 0; i < SIGNAL_LEN; i++) { e = L_mac(e, signal[i], signal[i]); } /* the energy is now in e */ int i; Word32 e0, e1, e2, e3; Word16 signal[SIGNAL_SIZE]; #pragma align signal 8 /* ... */ e0 = e1 = e2 = e3 = 0; for(i = 0; i < SIGNAL_LEN; i+=4) { e0 = L_mac(e0, signal[i+0], signal[i+0]); e1 = L_mac(e1, signal[i+1], signal[i+1]); e2 = L_mac(e2, signal[i+2], signal[i+2]); e3 = L_mac(e3, signal[i+3], signal[i+3]); } e0 = L_add(e0, e1); e1 = L_add(e2, e3); e0 = L_add(e0, e1); /* the energy is now in e0 */

  37. How many splits? • depending on memory alignment • loop count should be multiple of number of splits • the split have to be recombined to get the final result • decision depends on the nesting level

  38. Loop Splitting • splits a loop into two or more separate loops • increases cache locality in big loops • decreases register pressure • minimizes dependencies • may create auxiliary storage

  39. Multisample • complex transformation for nested loops • loop unrolling on the outer loop • loop merging of the inner loops • loop unrolling of the merged inner loop • keeps bit-exactness • does not impose alignment restrictions • reduces the data move operations

  40. Multisample Example - Step 1 for (j = 0; j < N; j += 4) { acc0 = 0; for (i = 0; i < T; i++) acc0 = L_mac(acc0, x[i], c[j+0+i]); res[j+0] = acc0; acc1 = 0; for (i = 0; i < T; i++) acc1 = L_mac(acc1, x[i], c[j+1+i]); res[j+1] = acc1; acc2 = 0; for (i = 0; i < T; i++) acc2 = L_mac(acc2, x[i], c[j+2+i]); res[j+2] = acc2; acc3 = 0; for (i = 0; i < T; i++) acc3 = L_mac(acc3, x[i], c[j+3+i]); res[j+3] = acc3; } Word32 acc; Word16 x[], c[]; int i,j,N,T; assert((N % 4) == 0); assert((T % 4) == 0); for (j = 0; j < N; j++) { acc = 0; for (i = 0; i < T; i++) { acc = L_mac(acc, x[i], c[j+i]); } res[j] = acc; } 1 2 Outer loop unrolling

  41. Multisample Example - Step 2 for (j = 0; j < N; j += 4) { acc0 = 0; for (i = 0; i < T; i++) acc0 = L_mac(acc0, x[i], c[j+0+i]); res[j+0] = acc0; acc1 = 0; for (i = 0; i < T; i++) acc1 = L_mac(acc1, x[i], c[j+1+i]); res[j+1] = acc1; acc2 = 0; for (i = 0; i < T; i++) acc2 = L_mac(acc2, x[i], c[j+2+i]); res[j+2] = acc2; acc3 = 0; for (i = 0; i < T; i++) acc3 = L_mac(acc3, x[i], c[j+3+i]); res[j+3] = acc3; } 2 for (j = 0; j < N; j += 4) { acc0 = 0; acc1 = 0; acc2 = 0; acc3 = 0; for (i = 0; i < T; i++) acc0 = L_mac(acc0, x[i], c[j+0+i]); for (i = 0; i < T; i++) acc1 = L_mac(acc1, x[i], c[j+1+i]); for (i = 0; i < T; i++) acc2 = L_mac(acc2, x[i], c[j+2+i]); for (i = 0; i < T; i++) acc3 = L_mac(acc3, x[i], c[j+3+i]); res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 3 Rearrangement

  42. Multisample Example - Step 3 for (j = 0; j < N; j += 4) { acc0 = 0; acc1 = 0; acc2 = 0; acc3 = 0; for (i = 0; i < T; i++) acc0 = L_mac(acc0, x[i], c[j+0+i]); for (i = 0; i < T; i++) acc1 = L_mac(acc1, x[i], c[j+1+i]); for (i = 0; i < T; i++) acc2 = L_mac(acc2, x[i], c[j+2+i]); for (i = 0; i < T; i++) acc3 = L_mac(acc3, x[i], c[j+3+i]); res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 3 Inner loop merging for (j = 0; j < N; j += 4) { acc0 = 0; acc1 = 0; acc2 = 0; acc3 = 0; for (i = 0; i < T; i++) { acc0 = L_mac(acc0, x[i], c[j+0+i]); acc1 = L_mac(acc1, x[i], c[j+1+i]); acc2 = L_mac(acc2, x[i], c[j+2+i]); acc3 = L_mac(acc3, x[i], c[j+3+i]); } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 4

  43. Multisample Example - Step 4 for (j = 0; j < N; j += 4) { acc0 = 0; acc1 = 0; acc2 = 0; acc3 = 0; for (i = 0; i < T; i++) { acc0 = L_mac(acc0, x[i], c[j+0+i]); acc1 = L_mac(acc1, x[i], c[j+1+i]); acc2 = L_mac(acc2, x[i], c[j+2+i]); acc3 = L_mac(acc3, x[i], c[j+3+i]); } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } for (j = 0; j < N; j += 4) { acc0 = 0;acc1 = 0;acc2 = 0;acc3 = 0; for (i = 0; i < T; i += 4) { acc0 = L_mac(acc0, x[i+0], c[j+0+i]); acc1 = L_mac(acc1, x[i+0], c[j+1+i]); acc2 = L_mac(acc2, x[i+0], c[j+2+i]); acc3 = L_mac(acc3, x[i+0], c[j+3+i]); acc0 = L_mac(acc0, x[i+1], c[j+1+i]); acc1 = L_mac(acc1, x[i+1], c[j+2+i]); acc2 = L_mac(acc2, x[i+1], c[j+3+i]); acc3 = L_mac(acc3, x[i+1], c[j+4+i]); /* third loop body for x[i+2] */ /* fourth loop body for x[i+3] */ } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 4 5 Inner loop unrolling

  44. Multisample Example - Step 5 Explicit scalarization for (j = 0; j < N; j += 4) { acc0 = 0;acc1 = 0;acc2 = 0;acc3 = 0; for (i = 0; i < T; i += 4) { acc0 = L_mac(acc0, x[i+0], c[j+0+i]); acc1 = L_mac(acc1, x[i+0], c[j+1+i]); acc2 = L_mac(acc2, x[i+0], c[j+2+i]); acc3 = L_mac(acc3, x[i+0], c[j+3+i]); acc0 = L_mac(acc0, x[i+1], c[j+1+i]); acc1 = L_mac(acc1, x[i+1], c[j+2+i]); acc2 = L_mac(acc2, x[i+1], c[j+3+i]); acc3 = L_mac(acc3, x[i+1], c[j+4+i]); /* third loop body for x[i+2] */ /* fourth loop body for x[i+3] */ } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 5 for (j = 0; j < N; j += 4) { acc0=acc1=acc2=acc3=0; xx = x[i+0]; c0 = c[j+0]; c1 = c[j+1]; c2 = c[j+2]; c3 = c[j+3]; for (i = 0; i < T; i += 4) { acc0 = L_mac(acc0, xx, c0); acc1 = L_mac(acc1, xx, c1); acc2 = L_mac(acc2, xx, c2); acc3 = L_mac(acc3, xx, c3); xx = x[i+1]; c0 = c[j+4+i]; acc0 = L_mac(acc0, xx, c1); acc1 = L_mac(acc1, xx, c2); acc2 = L_mac(acc2, xx, c3); acc3 = L_mac(acc3, xx, c0); xx = x[i+2]; c0 = c[j+5+i]; /* similar third and fourth loop */ } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } 6

  45. Multisample Example - Final Result for (j = 0; j < N; j += 4) { acc0=acc1=acc2=acc3=0; xx = x[i+0]; c0 = c[j+0]; c1 = c[j+1]; c2 = c[j+2]; c3 = c[j+3]; for (i = 0; i < T; i += 4) { acc0 = L_mac(acc0, xx, c0); acc1 = L_mac(acc1, xx, c1); acc2 = L_mac(acc2, xx, c2); acc3 = L_mac(acc3, xx, c3); xx = x[i+1]; c0 = c[j+4+i]; acc0 = L_mac(acc0, xx, c1); acc1 = L_mac(acc1, xx, c2); acc2 = L_mac(acc2, xx, c3); acc3 = L_mac(acc3, xx, c0); xx = x[i+2]; c0 = c[j+5+i]; /* similar third and fourth loop */ } res[j+0] = acc0; res[j+1] = acc1; res[j+2] = acc2; res[j+3] = acc3; } one cycle (4 macs + 2 moves) one cycle (4 macs + 2 moves)

  46. Search Techniques Adapted for StarCore SC140 • finding for maximum • finding the maximum and its position • finding the maximum ratio • finding the position of the maximum ratio

  47. Maximum search #include <prototype.h> Word16 max_search(Word16 vector[], unsigned short int length ) { #pragma align *vector 8 signed short int i = 0; Word16 max0, max1, max2, max3; max0 = vector[i+0]; max1 = vector[i+1]; max2 = vector[i+2]; max3 = vector[i+3]; for(i=4; i<length; i+=4) { max0 = max(max0, vector[i+0]); max1 = max(max1, vector[i+1]); max2 = max(max2, vector[i+2]); max3 = max(max3, vector[i+3]); } max0 = max(max0, max1); max1 = max(max2, max3); return max(max0, max1); } • reduction operation => split maximum

  48. Maximum search - Comments • presented solution works for both 16-bit and 32-bit values • better solution for 16-bit (asm only) • based on max2 instruction (SIMD style) • fetches two 16-bit values as a 32-bit one • eight maxima per cycle

  49. Maximum Search – ASM with max2 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 FALIGN LOOPSTART3 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] LOOPEND3 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 ]

  50. Maximum position • based on comparisons • around N cycles • in C, two maxima required • based on maximum search • around N/2 cycles • based on max2-based maximum search • around N/4 cycles • care must be taken to preserve the original semantics

More Related