Optimizing VORPAL

Optimizing VORPAL Viktor Przebinda, J. R. Cary and Chet Nieter Department of Physics, University of Colorado, Boulder 80309-0390 Goal: To produce an optimal, target independent code while preserving modularity.

Significant performance increase can be achieved through optimization

VORPAL is optimized through the use of three techniques • Template Metaprogramming • Inspection of compiler generated assembly code • Dependency minimization

Template Metaprogramming provides dimension independence for free Template metaprogramming allows source code to be compiled for an arbitrary number of dimensions. Once compiled, the executable runs as fast as a code manually written to run with a fixed number of dimensions. Strategy: Use recursion to process a given number of dimensions and specify the dimension at compile time, allowing the compiler to generate all code inline. Walk<NDIM>(rgn, updater) { for(int i=0;i<max; ++i){ Walk<NDIM-1>(rgn, updater); updater.bump(NDIM); } updater.bump(NDIM, - ub); } Continuously expands at compile time until NDIM = 1 Modifies (updates) position information. Field iterater stores current position in field.

Template Metaprogramming - continued Templates are compiled for every possible dimension through explicit instantiations. Example: NDIM=2 Code block inserted inline at compile time. Walk<2>(rgn, updater) { for(int i=0;i<max; ++i){ walk<1>(rgn, updater); updater.bump(2); } updater.bump(2, - ub); } Walk<1>(rgn, updater) { for (int i=0;i<max;++i) updater.updateNode(); updater.bump(2, - ub); } Instructs iterater to have its field(s) update themselves.

Source code can be optimized through careful inspection of generated assembly Due to a variety of reasons, a compiler will occasionally produce drastically inefficient target code out of easily optimizable source. Careful inspection of compiler generated assembly code can reveal such bottlenecks. Since compilers are generally implemented in the same way, source modifications to improve generated assembly code for a specific architecture almost always have the same effect on other architectures.

Many compilers fail to detect when size() member of C++ vector is constant Source Code Generated Assembly for POWER3* If we compile this for the POWER3, we get: __L80: cmpl 5,r5,r0 stu r10,4(r4) bc BO_IF_NOT,CR0_LT,__Ld8 stu r10,4(r4) cmpl 4,r3,r0 bc BO_IF_NOT,CR1_FX,__Ld8 cal r5,5(r3) cmpl 0,r6,r0 stu r10,4(r4) cal r3,6(r3) bc BO_IF_NOT,CR6_LT,__Ld8 cmpl 1,r7,r0 stu r10,4(r4) cal r6,1(r3) cal r7,2(r3) bc BO_IF_NOT,CR7_LT,__Ld8 stu r10,4(r4) cmpl 6,r8,r0 cal r8,3(r3) bc BO_IF,CR4_LT,__L70 bc BO_IF_NOT,CR5_LT,__Ld8 stu r10,4(r4) __Ld8: l r12,4(SP) mtcrf 8,r12 bcr BO_ALWAYS,CR0_LT void size_evil(vector<int>& a) { for (int i=0;i<a.size();++i) a[i]=4; } The compiler unrolls this loop, but must check the exit condition and possibly branch at each iteration, why? It does not realize that the size of vector ‘a’ will not change during an iteration! *Gcc2.95 generates almost identical code, however gcc3.1 unrolls the loop correctly.

This can be corrected by taking call to size() member out of loop Source Optimized code Generated Assembly for POWER3* void size_evil(vector<int>& a) { int s=a.size(); for (int i=0;i<s;++i) { a[i]=4; } } __L4c: st r5,4(r4) st r5,8(r4) st r5,12(r4) st r5,16(r4) st r5,20(r4) st r5,24(r4) st r5,28(r4) stu r5,32(r4) bc BO_dCTR_NZERO,CR0_LT,__L4c bcr BO_ALWAYS,CR0_LT Unrolled loop Note: The compiler does not need to insert a corrector loop since it can over allocate storage. *Gcc2.95 generates almost identical code.

Use of standard library routines often cannot be inlined Computations of square root and sine from the standard math library on the IBM POWER3. float f(float x) { return sqrt(x); } float retsin(float x) { return sin(x); } .s__Ff: mfspr r0,LR frsp fp1,fp1 stu SP,-64(SP) st r0,72(SP) bl ._sqrt{PR} oril r0,r0,0x0000 frsp fp1,fp1 l r12,72(SP) cal SP,64(SP) mtspr LR,r12 bcr BO_ALWAYS,CR0_LT .retsin__Ff: mfspr r0,LR frsp fp1,fp1 stu SP,-64(SP) st r0,72(SP) bl .sin{PR} oril r0,r0,0x0000 frsp fp1,fp1 l r12,72(SP) cal SP,64(SP) mtspr LR,r12 bcr BO_ALWAYS,CR0_LT The sqrt cannot be inlined. The compiler must insert a branch to a library provided implementation. The x86 floating point unit provides a square root instruction that is encoded as a macro in the standard math library. Computation of sine from the standard C library results in a branch on both x86 and the POWER3 since neither provide an instruction to compute sine.

Typecast from float to int requires significant overhead A typecast from an IEEE float to a fixed point representation seems like a trivial operation but on some architectures can be extraordinarily costly. Typecast from float to int on x86 using gcc3.1 int ftoi(float x) {return (int)x;} Load current configuration flds 8(%ebp) fnstcw -2(%ebp) movw -2(%ebp),%dx orw $3072,%dx movw %dx,-4(%ebp) fldcw -4(%ebp) fistpl -8(%ebp) movl -8(%ebp),%eax fldcw -2(%ebp) movl %ebp,%esp popl %ebp ret Write new configuration Before performing a cast, the compiler must reset the floating point unit’s configuration settings, a very expensive operation, to ensure consistent results. Perform cast Restore original configuration There is no good solution to this problem. One possible hack is to write an inline assembly macro that performs the cast without resetting the configuration, however this does not ensure consistent results. Thanks to Kevin Bowers for pointing this out.

Dependency minimization is essential to achieve optimum efficiency Modern processors are capable of executing numerous instructions in parallel. For example: The POWER3 processor has • 2 floating point units • 3 fixed point units • 2 load-store units Most programming languages, however, are designed to express a consecutive sequence of computations. To utilize full CPU potential, source code must be written to minimize dependency between neighboring instructions in situations where the compiler is unable to do so.

Source can be modified to minimize dependencies. An array element update: Each iteration is dependent on the completion of the previous one since *r is updated. void update(float* r, float* a, float*b, int s) { int i; for (i=0;i<s;++i) *r+=a[i]*b[i]; } Can be optimized to take advantage of additional execution units: void update(float* r, float* a, float*b, int s) { int i; float x,y; x=0; y=0; for (i=0;i<s-1;i+=2) { x+=a[i]*b[i]; y+=a[i+1]*b[i+1]; } *r+=x+y; } Caution! Numerical instabilities can occur when using this method since the summation is done in a different order. This is why the compiler is not able to perform this optimization. Each computation is dispatched to a separate execution unit.

Future Plans - Dynamic Load Balancing VORPAL will soon be capable of dynamically adjusting its decomposition at runtime, reducing the size of units that take more time to process and increasing the size of those that take less. CPU0 CPU1 CPU0 has an abnormally larger number of particles to process than CPU1 and is falling behind. At the end of each time step,the domain decomposition is adjusted to balance the load.

Summary Significant optimization can be achieved through: • Template Metaprogramming to retain a high degree of flexibility with no performance cost. • Assembly inspection and source modification to locate and correct contexts where the compiler generates unusually inefficient code. • Dependency minimization to maximize CPU utilization.

Optimizing VORPAL

Optimizing VORPAL

Presentation Transcript

VORPAL for Simulating RF Breakdown

Optimizing RPC

Optimizing Performance

Optimizing Performance

Optimizing Sleep

Optimizing Antibiotics

Plasma Medicine in Vorpal

Multipactor Simulations with VORPAL

Optimizing Membership

Optimizing Images

Optimizing Reimbursement

Optimizing GPS accuracy

Optimizing Advertising

Optimizing Patient Care

Optimizing AspectJ

Optimizing Compilers

Optimizing the Keyboard

Optimizing SHGs

Dynamic Load Balancing for VORPAL

Optimizing RPC

Optimizing ARM Assembly

Optimizing ARM Assembly