1 / 51

Machine-Independent Optimization

Machine-Independent Optimization. Outline. Machine-Independent Optimization Code motion Memory optimization Optimizing Blockers Memory alias Side effect in function call Suggested reading 5.3 , 5.2 , 5.4 ~ 5.6, 5.1. Motivation. Constant factors matter too!

morrie
Télécharger la présentation

Machine-Independent Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine-Independent Optimization

  2. Outline • Machine-Independent Optimization • Code motion • Memory optimization • Optimizing Blockers • Memory alias • Side effect in function call • Suggested reading • 5.3,5.2,5.4 ~ 5.6, 5.1

  3. Motivation • Constant factors matter too! • easily see 10:1 performance range depending on how code is written • must optimize at multiple levels • algorithm, data representations, procedures, and loops

  4. Motivation • Must understand system to optimize performance • how programs are compiled and executed • how to measure program performance and identify bottlenecks • how to improve performance without destroying code modularity and generality

  5. length 0 1 2 length–1 data    Vector ADT typedef struct { int len ; data_t *data ; } vec_rec, *vec_ptr ; typedef int data_t ;

  6. Procedures • vec_ptr new_vec(int len) • Create vector of specified length • data_t *get_vec_start(vec_ptr v) • Return pointer to start of vector data

  7. Procedures • int get_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful • Similar to array implementations in Pascal, Java • E.g., always do bounds checking

  8. Vector ADT vec_ptr new_vec(int len) { /* allocate header structure */ vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ; if ( !result ) return NULL ; result->len = len ;

  9. Vector ADT /* allocate array */ if ( len > 0 ) { data_t *data = (data_t *)calloc(len, sizeof(data_t)) ; if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */ } result->data = data } else result->data = NULL return result ; }

  10. Vector ADT /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index < 0 || index >= v->len) return 0 ; *dest = v->data[index] ; return 1; }

  11. Vector ADT /* Return length of vector */ int vec_length(vec_ptr) { return v->len ; } /* Return pointer to start of vector data */ data_t *get_vec_start(vec_ptr v) { return v->data ; }

  12. Optimization Example #ifdef ADD #define IDENT 0 #define OP + #else #define IDENT 1 #define OP * #endif

  13. Optimization Example void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } }

  14. Optimization Example • Procedure • Compute sum (product) of all elements of vector • Store result at destination location

  15. Time Scales • Absolute Time • Typically use nanoseconds • 10–9 seconds • Time scale of computer instructions

  16. Time Scales • Clock Cycles • Most computers controlled by high frequency clock signal • Typical Range • 100 MHz • 108 cycles per second • Clock period = 10ns • 2 GHz • 2 X 109 cycles per second • Clock period = 0.5ns

  17. CPE 1 void psum1(float a[], float p[], long int n) 2 { 3 long int i; 4 5 p[0] = a[0] ; 6 for (i = 0; i < n; i++) 7 p[i] = p[i-1] + a[i]; 8 } 9

  18. CPE 10 void psum2(float a[], float p[]; long int n) 11 { 12 long inti; 13 p[0] = a[0] ; 14 for (i = 1; i < n-1; i+=2) { 15 float mid_val = p[i-1] + a[i] ; 16 p[i] = mid_val ; 17 p[i+1] = mid_val + a[i+1]; 18 } • /* For odd n, finish remaining element */ • if ( i < n ) • p[i] = p[i-1] + a[i] ; • }

  19. Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists • Length = n • T = CPE*n + Overhead

  20. Cycles Per Element

  21. Time Scales

  22. Understanding Loop void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } }

  23. 1 iteration Understanding Loop void combine1-goto(vec_ptr v, data_t *dest) { long int i = 0; data_t val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: }

  24. Inefficiency • Procedure vec_length called every iteration • Even though result always the same

  25. Code Motion void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } }

  26. Code Motion • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion

  27. Code Motion 1 /* Convert string to lowercase: slow */ 2 void lower1(char *s) 3 { 4 int i; 5 6 for (i = 0; i < strlen(s); i++) 7 if (s[i] >= ’A’ && s[i] <= ’Z’) 8 s[i] -= (’A’ - ’a’); 9 } 10

  28. Code Motion 11 /* Convert string to lowercase: faster */ 12 void lower2(char *s) 13 { 14 int i; 15 int len = strlen(s); 16 17 for (i = 0; i < len; i++) 18 if (s[i] >= ’A’ && s[i] <= ’Z’) 19 s[i] -= (’A’ - ’a’); 20 } 21

  29. Code Motion 22 /* Sample implementation of library function strlen */ 23 /* Compute length of string */ 24 size_t strlen(const char *s) 25 { 26 int length = 0; 27 while (*s != ’\0’) { 28 s++; 29 length++; 30 } 31 return length; 32 }

  30. Code Motion

  31. Reduction in Strength void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for ( i = 0 ; i < length ; i++ ) { *dest = *dest OP data[i] ; }

  32. Reduction in Strength • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction

  33. Eliminate Unneeded Memory References combine3: data_t = float, OP = * i in %rdx, data in %rax, dest in %rbp 1 .L498: loop: 2 movss (%rbp), %xmm0 Read product from dest 3 mulss (%rax,%rdx,4), %xmm0 Multiply product by data[i] 4 movss %xmm0, (%rbp) Store product at dest 5 addq $1, %rdx Increment i 6 cmpq %rdx, %r12 Compare i:limit 7 jg .L498 If >, goto loop

  34. Eliminate Unneeded Memory References void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for (i = 0; i < length; i++) acc = acc OP data[i]; *dest = acc; }

  35. Eliminate Unneeded Memory References combine4: data_t = float, OP = * i in %rdx, data in %rax, limit in %rbp, acc in %xmm0 1 .L488: loop: 2 mulss (%rax,%rdx,4), %xmm0 Multiply acc by data[i] 3 addq $1, %rdx Increment i 4 cmpq %rdx, %rbp Compare limit:i 5 jg .L488 If >, goto loop

  36. Eliminate Unneeded Memory References • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle

  37. Machine Independent Opt. Results • Optimizations • Reduce function calls and memory references within loop

  38. Optimizing Compilers • Provide efficient mapping of program to machine • register allocation • code selection and ordering • eliminating minor inefficiencies

  39. Optimizing Compilers • Don’t (usually) improve asymptotic efficiency • up to programmer to select best overall algorithm • big-O savings are (often) more important than constant factors • but constant factors also matter • Have difficulty overcoming “optimization blockers” • potential memory aliasing • potential procedure side-effects

  40. Optimization Blockers  Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; *xp += *yp ; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp ; }

  41. Optimization Blockers  Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) { return 4*f(x) ; }

  42. Optimization Blockers  Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

  43. Optimization Blocker: Memory Aliasing • Aliasing • Two different memory references specify single location • Example • v: [2, 3, 5] • combine3(v, get_vec_start(v)+2) --> ? • combine4(v, get_vec_start(v)+2) --> ?

  44. Optimization Blocker: Memory Aliasing • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

  45. Limitations of Optimizing Compilers • Operate Under Fundamental Constraint • Must not cause any change in program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions.

  46. Limitations of Optimizing Compilers • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • e.g., using an “int” in C for what could be an enumerated type

  47. Limitations of Optimizing Compilers • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative

  48. Example void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } }

  49. Example void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } }

  50. Example void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OP data[i]; }

More Related