More Code Optimization

More Code Optimization

Outline • Memory Performance • Tuning Performance • Suggested reading • 5.12 ~ 5.14

Load Performance • load unit can only initiate one load operation every clock cycle (Issue=1.0) typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; while (ls) { len++ ; ls = ls->next; } return len ; } len in %eax, ls in %rdi .L11: addl $1, %eax movq (%rdi), %rdi testq %rdi, %rdi jne .L11 Function CPE list_len 4.0 load latency 4.0

Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear(int *dest, int n) { int i; for (i = 0; i < n; i++) dest[i] = 0; } Function CPE array_clear 2.0

Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear_4(int *dest, int n) { int i; int limit = n-3; for (i = 0; i < limit; i+=4) { dest[i] = 0; dest[i+1] = 0; dest[i+2] = 0; dest[i+3] = 0; } for ( ; i < n; i++) dest[i] = 0; } Function CPE array_clear_4 1.0

Store Performance void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; } } Example A: write_read(&a[0],&a[1],3) iter2 iter1 iter2 initial iter3 initial iter1 iter3 cnt 3 2 2 3 1 1 0 0 a -10 -10 1 0 -10 -10 -10 2 17 17 17 17 -9 -9 17 0 val 0 1 2 0 -9 -9 -9 3 Example B: write_read(&a[0],&a[0],3) cnt a Function CPE Example A 2.0 Example B 6.0 val

Load and Store Units Store Unit Load Unit Store buffer address data address Matching addresses Data Address Data Address Data Data Cache

Graphical Representation %eax %ebx %ecx %edx s_addr movl %eax,(%ecx) s_data movl (%ebx), %eax load t addl $1,%eax add subl $1,%edx sub jne loop jne %eax %ebx %ecx %edx //inner-loop while (cnt--) { *dest = val; val = (*src)+1; }

Graphical Representation %eax %ebx %ecx %edx %eax %edx s_addr S-data 2 3 1 s_data load sub load sub add jg add %eax %edx %edx %eax

Graphical Representation Example B Example A Critical Path S_data load load mul sub sub add mul S_data load load mul Function CPE Example A 2.0 Example B 6.0 sub sub add mul

Getting High Performance • High-level design • Choose appropriate algorithms and data structures for the problem at hand • Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance

Getting High Performance • Basic coding principles • Avoid optimization blockers so that a compiler can generate efficient code. • Eliminate excessive function calls • Move computations out of loops when possible • Consider selective compromises of program modularity to gain greater efficiency • Eliminate unnecessary memory references. • Introduce temporary variables to hold intermediate results • Store a result in an array or global variable only when the final value has been computed.

Getting High Performance • Low-level optimizations • Unroll loops to reduce overhead and to enable further optimizations • Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation • Rewrite conditional operations in a functional style to enable compilation via conditional data transfers • Write cache friendly code

Performance Tuning • Identify • Which is the hottest part of the program • Using a very useful method profiling • Instrument the program • Run it with typical input data • Collect information from the result • Analysis the result

Examples unix> gcc –O1 –pg prog.c –o prog unix> ./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls s/call s/call name 97.58 173.05 173.05 1 173.05 173.05 sort_words 2.36 177.24 4.19 965027 0.00 0.00 find_ele_rec 0.12 177.46 0.22 12511031 0.00 0.00 Strlen

Principle • Interval counting • Maintain a counter for each function • Record the time spent executing this function • Interrupted at regular time (1ms) • Check which function is executing when interrupt occurs • Increment the counter for this function • The calling information is quite reliable • By default, the timings for library functions are not shown

Program Example • Task • Analyzing the n-gram statistics of a text document • an n-gram is a sequence of n words occurring in a document • reads a text file, • creates a table of unique n-grams • specifying how many times each one occurs • sorts the n-grams in descending order of occurrence

Program Example • Steps • Convert strings to lowercase • Apply hash function • Read n-grams and insert into hash table • Mostly list operations • Maintain counter for each unique n-gram • Sort results • Data Set • Collected works of Shakespeare • 965,028 total words, 23,706 unique • N=2, called bigrams • 363,039 unique bigrams

Example 158655725 find_ele_rec [5] 4.19 0.02 965027/965027 insert_string [4] [5] 2.4 4.19 0.02 965027+158655725 find_ele_rec [5] 0.01 0.01 363039/363039 new_ele [10] 0.00 0.01 363039/363039 save_string [13] 158655725 find_ele_rec [5] • Ratio : 158655725/965027 = 164.4 • The average length of a list in one hash bucket is 164

Code Optimizations • First step: Use more efficient sorting function • Library function qsort

Further Optimizations

Optimizaitons • Iter first: Use iterative function to insert elements in linked list • Causes code to slow down • Iter last: Iterative function, places new entry at end of list • Tend to place most common words at front of list • Big table: Increase number of hash buckets • Better hash: Use more sophisticated hash function • Linear lower: Move strlen out of loop

Code Motion 1 /* Convert string to lowercase: slow */ 2 void lower1(char *s) 3 { 4 int i; 5 6 for (i = 0; i < strlen(s); i++) 7 if (s[i] >= ’A’ && s[i] <= ’Z’) 8 s[i] -= (’A’ - ’a’); 9 } 10

Code Motion 11 /* Convert string to lowercase: faster */ 12 void lower2(char *s) 13 { 14 int i; 15 int len = strlen(s); 16 17 for (i = 0; i < len; i++) 18 if (s[i] >= ’A’ && s[i] <= ’Z’) 19 s[i] -= (’A’ - ’a’); 20 } 21

Code Motion 22 /* Sample implementation of library function strlen */ 23 /* Compute length of string */ 24 size_t strlen(const char *s) 25 { 26 int length = 0; 27 while (*s != ’\0’) { 28 s++; 29 length++; 30 } 31 return length; 32 }

Code Motion

Performance Tuning • Benefits • Helps identify performance bottlenecks • Especially useful when have complex system with many components • Limitations • Only shows performance for data tested • E.g., linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code • Timing mechanism fairly crude • Only works for programs that run for > 3 seconds

Amdahl’s Law Tnew = (1-)Told + (Told)/k = Told[(1-) + /k] S = Told / Tnew = 1/[(1-) + /k] S = 1/(1-)

More Code Optimization