280 likes | 402 Vues
This guide outlines crucial strategies for optimizing code performance and memory usage in software development. It discusses the importance of selecting suitable algorithms and data structures while avoiding optimization pitfalls. We cover low-level optimization techniques such as loop unrolling and increasing instruction-level parallelism, as well as performance tuning through profiling and identifying bottlenecks. Practical examples illustrate methods for enhancing load and store operations, particularly in relation to n-gram analysis in textual data. Explore efficient coding principles designed to minimize overhead and maximize execution speed.
E N D
Outline • Memory Performance • Tuning Performance • Suggested reading • 5.12 ~ 5.14
Load Performance • load unit can only initiate one load operation every clock cycle (Issue=1.0) typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; while (ls) { len++ ; ls = ls->next; } return len ; } len in %eax, ls in %rdi .L11: addl $1, %eax movq (%rdi), %rdi testq %rdi, %rdi jne .L11 Function CPE list_len 4.0 load latency 4.0
Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear(int *dest, int n) { int i; for (i = 0; i < n; i++) dest[i] = 0; } Function CPE array_clear 2.0
Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear_4(int *dest, int n) { int i; int limit = n-3; for (i = 0; i < limit; i+=4) { dest[i] = 0; dest[i+1] = 0; dest[i+2] = 0; dest[i+3] = 0; } for ( ; i < n; i++) dest[i] = 0; } Function CPE array_clear_4 1.0
Store Performance void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; } } Example A: write_read(&a[0],&a[1],3) iter2 iter1 iter2 initial iter3 initial iter1 iter3 cnt 3 2 2 3 1 1 0 0 a -10 -10 1 0 -10 -10 -10 2 17 17 17 17 -9 -9 17 0 val 0 1 2 0 -9 -9 -9 3 Example B: write_read(&a[0],&a[0],3) cnt a Function CPE Example A 2.0 Example B 6.0 val
Load and Store Units Store Unit Load Unit Store buffer address data address Matching addresses Data Address Data Address Data Data Cache
Graphical Representation %eax %ebx %ecx %edx s_addr movl %eax,(%ecx) s_data movl (%ebx), %eax load t addl $1,%eax add subl $1,%edx sub jne loop jne %eax %ebx %ecx %edx //inner-loop while (cnt--) { *dest = val; val = (*src)+1; }
Graphical Representation %eax %ebx %ecx %edx %eax %edx s_addr S-data 2 3 1 s_data load sub load sub add jg add %eax %edx %edx %eax
Graphical Representation Example B Example A Critical Path S_data load load mul sub sub add mul S_data load load mul Function CPE Example A 2.0 Example B 6.0 sub sub add mul
Getting High Performance • High-level design • Choose appropriate algorithms and data structures for the problem at hand • Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance
Getting High Performance • Basic coding principles • Avoid optimization blockers so that a compiler can generate efficient code. • Eliminate excessive function calls • Move computations out of loops when possible • Consider selective compromises of program modularity to gain greater efficiency • Eliminate unnecessary memory references. • Introduce temporary variables to hold intermediate results • Store a result in an array or global variable only when the final value has been computed.
Getting High Performance • Low-level optimizations • Unroll loops to reduce overhead and to enable further optimizations • Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation • Rewrite conditional operations in a functional style to enable compilation via conditional data transfers • Write cache friendly code
Performance Tuning • Identify • Which is the hottest part of the program • Using a very useful method profiling • Instrument the program • Run it with typical input data • Collect information from the result • Analysis the result
Examples unix> gcc –O1 –pg prog.c –o prog unix> ./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls s/call s/call name 97.58 173.05 173.05 1 173.05 173.05 sort_words 2.36 177.24 4.19 965027 0.00 0.00 find_ele_rec 0.12 177.46 0.22 12511031 0.00 0.00 Strlen
Principle • Interval counting • Maintain a counter for each function • Record the time spent executing this function • Interrupted at regular time (1ms) • Check which function is executing when interrupt occurs • Increment the counter for this function • The calling information is quite reliable • By default, the timings for library functions are not shown
Program Example • Task • Analyzing the n-gram statistics of a text document • an n-gram is a sequence of n words occurring in a document • reads a text file, • creates a table of unique n-grams • specifying how many times each one occurs • sorts the n-grams in descending order of occurrence
Program Example • Steps • Convert strings to lowercase • Apply hash function • Read n-grams and insert into hash table • Mostly list operations • Maintain counter for each unique n-gram • Sort results • Data Set • Collected works of Shakespeare • 965,028 total words, 23,706 unique • N=2, called bigrams • 363,039 unique bigrams
Example 158655725 find_ele_rec [5] 4.19 0.02 965027/965027 insert_string [4] [5] 2.4 4.19 0.02 965027+158655725 find_ele_rec [5] 0.01 0.01 363039/363039 new_ele [10] 0.00 0.01 363039/363039 save_string [13] 158655725 find_ele_rec [5] • Ratio : 158655725/965027 = 164.4 • The average length of a list in one hash bucket is 164
Code Optimizations • First step: Use more efficient sorting function • Library function qsort
Optimizaitons • Iter first: Use iterative function to insert elements in linked list • Causes code to slow down • Iter last: Iterative function, places new entry at end of list • Tend to place most common words at front of list • Big table: Increase number of hash buckets • Better hash: Use more sophisticated hash function • Linear lower: Move strlen out of loop
Code Motion 1 /* Convert string to lowercase: slow */ 2 void lower1(char *s) 3 { 4 int i; 5 6 for (i = 0; i < strlen(s); i++) 7 if (s[i] >= ’A’ && s[i] <= ’Z’) 8 s[i] -= (’A’ - ’a’); 9 } 10
Code Motion 11 /* Convert string to lowercase: faster */ 12 void lower2(char *s) 13 { 14 int i; 15 int len = strlen(s); 16 17 for (i = 0; i < len; i++) 18 if (s[i] >= ’A’ && s[i] <= ’Z’) 19 s[i] -= (’A’ - ’a’); 20 } 21
Code Motion 22 /* Sample implementation of library function strlen */ 23 /* Compute length of string */ 24 size_t strlen(const char *s) 25 { 26 int length = 0; 27 while (*s != ’\0’) { 28 s++; 29 length++; 30 } 31 return length; 32 }
Performance Tuning • Benefits • Helps identify performance bottlenecks • Especially useful when have complex system with many components • Limitations • Only shows performance for data tested • E.g., linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code • Timing mechanism fairly crude • Only works for programs that run for > 3 seconds
Amdahl’s Law Tnew = (1-)Told + (Told)/k = Told[(1-) + /k] S = Told / Tnew = 1/[(1-) + /k] S = 1/(1-)