Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems 37th International Symposium on Microarchitecture 2004 Gerolf Hoflehner, Knud Kirkegaard, Rod Skinner, Daniel Lavery, Yong-fong Lee, Wei Li Intel® Compiler Lab. May, 29, 2006 SNU, IDB Lab. Kisung Kim

Introduction • Describes compiler optimizations • Produce a 40% speed-up in OLTP performance • Implemented in the Intel C/C++ Itanium compiler • OLTP workload • A larger number of clients • Update a small portion of the database through short running transaction • e.g. bank, airline reservation •  a large instruction and data footprint and high I/O traffic

A Repertoire of compiler optimizations for Server Applications • RSE traffic reduction • setjmp()/longjmp() Optimization • Linux preemption model • Data layout optimizations • Instruction prefetching

RSE Traffic Reduction • Itanium architecture • 128 integer register, r32-r127 are stacked • Each procedure have its own variable size register stack frame • Resister Stack • Allocate a single variable-size register stack frame to each procedure

RSE Traffic Reduction • RSE(Register Stack Engine) • Maps a register stack frame onto the physical register file and copies values in registers to and from memory in response to overflow and underflow conditions • alloc instruction • Determines the size of the register stack frame • Second alloc instruction in the same procedure • Does not cause RSE spills • Shrink the register stack frame when it allocates a smaller stack frame than the previous alloc instruction

Needs Spill RSE Traffic Reduction • Unoptimized RSE foo(): alloc rx=0, 90, 0 1: bar(): call bar() alloc ry=0,50,0 2: … return alloc rz=0, 90, 0 3:

RSE Traffic Reduction • Optimized RSE foo(): alloc rx=0, 90, 0 1: alloc rz=0,30,0 2: bar(): call bar() alloc ry=0,50,0 3: … return alloc rz=0, 90, 0 4:

RSE Traffic Reduction • Shrink register stack • Liveness analysis determines the registers that are unused at the point of the call • If the number of dead registers on top of the register stack exceed a given threshold, the register stack is reduced by the amount of dead registers

Overhead • Compiler does not know if the reduction of the register stack will actually decrease the RSE traffic at run-time • alloc instruction  scheduling constraints, increase in code size • For an OLTP workload, the empirically found sweet spot was a threshold of 10 registers • alloc instructions are inserted only when at least 10 registers at the top of the register stack are found dead

setjmp()/longjmp() Optimization • Sequences of setjmp()/longjmp() code are a common pattern in database applications • setjmp() • saves system state in jmp_buf structure • return 0 • longjmp() • reinstates the function state from jmp_buf • return 1 V1= r=setjmp() r==0? F T =V1 V2= =V2 foo()

setjmp()/longjmp() Optimization • Limit floating-point register available • Server application don’t need many floating-point operations • Use only the eight scratch fp argument registers • Avoid saving/restoring of preserved floating-point registers in jmp_buf buffer : 320 bytes

setjmp()/longjmp() Optimization • Cross lifetime • It is live at the setjmp call • Need special care r37= V1= r=setjmp() r=setjmp() r==0? r==0? F T F T = r37 r37= =r37 foo() =V1 V2= =V2 foo()

setjmp()/longjmp() Optimization • Solution • Dedicate the register for the rest of the procedure • Copy it to a real preserved register(r4-r7) • Spill to a dedicated memory stack location • Explicitly model the control flow from any function that might call longjmp to the associated setjmp call

setjmp()/longjmp() Optimization r37= V1= r=setjmp() r=setjmp() r==0? r==0? F T F T = r37 r38= =r38 foo() =V1 V2= =V2 foo()

setjmp()/longjmp() Optimization • Reduce spill/fill • Reduce memory stack size • Reduce code size • Eliminate spill/fill of callee preserved int registers at function entries and exits • Costs : increase in RSE traffic

Linux Preemption Model • Symbol preemption • A symbol is preemptible if at some time after linkage, the object it refers to may change main so(“shareable object”) int g = 10; void foo() { printf(“main %d\n”, g); } int main() { int i; i = bar(); printf(“bar = %d\n”, i); return 0; } int g = 5; void foo() { printf(“so %d \n”, g); } int bar() { foo(); return g; } Result (Symbol Preemption) main 10 bar=10 Result (No Symbol Preemption) so 5 bar=5

Cost of Symbol Preemption • Require position independent code • Position independent code : doesn’t contain any absolute addresses  important for shared library • Indirect addressing through linkage table • Global data • Addressed through linkage table via gp(global pointer) • Extra level of indirection add r3 = @ltoff(data),gp ld8 r2 = [r3] ld4 r8 = [r2] add r2 = @gprel(data),gp ld4 r8 = [r2]

Other Optimization • Data layout optimizations • Move string and constants to read-only section • Sort the local data on the memory stack based on frequency and size • Better D-Cache Utilization • Instruction prefetching • .few/.many completer  control instruction prefetching • Specify how many bundles get prefetched at the branch target

Evaluation • Scaled Setup • 4P Itanium2 1.5Ghz • 3M, 6M L3 cache • 32 Gb memory • Large workload • Cached Setup • 4P Itanium2 1.5Ghz • 3M, 6M L3 cache • 8 Gb memory • Small workload • Negligible disk I/O • Runs CPU bound High speed-up on a cached system does not necessarily translate into a high speed-up on a scaled setup Red Hat Linux 2.1, Oracle V9 and Intel Itanium Compiler V7.1

Speed-Ups per Optimization

Conclusion • Compiler optimizations are essential for OLTP performance on both cached and scaled setups • Memory traffic continues to be the major bottleneck for OLTP workloads • Interaction among the compiler optimizations may well deserve further study

Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems