Using Platform-Specific Performance Counters for Dynamic Compilation

Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich

Introduction & Motivation • Dynamic compilers common execution platform for OO languages (Java, C#) • Properties of OO programs difficult to analyze at compile-time • JIT compiler can immediately use information obtained at run-time

Introduction & Motivation Types of information: • Profiles: e.g. execution frequency of methods / basic blocks • Hardware-specific properties: cache misses, TLB misses, branch prediction failures

Outline • Introduction • Requirements • Related work • Implementation • Results • Conclusions

Requirements • Infrastructure flexible enough to measure different execution metrics • Hide machine-specific details from VM • Keep changes to the VM/compiler minimal • Runtime overhead of collecting information from the CPU low • Information must be precise to be useful for online optimization

Related work • Profile guided optimization • Code positioning [PettisPLDI90] • Hardware performance monitors • Relating HPM data to basic blocks [Ammons PLDI97] • “Vertical profiling” [Hauswirth OOPSLA 2004] • Dynamic optimization • Mississippi delta [Adl-Tabatabai PLDI2004] • Object reordering [Huang OOPSLA 2004] • Our work: • No instrumentation • Use profile data + hardware info • Targets fully automatic dynamic optimization

Hardware performance monitors • Sampling-based counting • CPU reports state every n events • Precision platform-dependent (pipelines, out-of-order execution) • Sampling provides method, basic block, or instruction-level information • Newer CPUs support precise sampling (e.g. P4, Itanium)

Hardware performance monitors • Way to localize performance bottlenecks • Sampling interval determines how fine-grained the information is • Smaller sampling interval  more data • Trade-off: precision vs. runtime overhead • Need enough samples for a representative picture of the program behavior

Implementation Main parts • Kernel module: low level access to hardware, per process counting • User-space library: hides kernel & device driver details from VM • Java VM thread: collects samples periodically, maps samples to Java code • Implemented on top of Jikes RVM

System overview

Implementation • Supported events: • L1 and L2 cache misses • DTLB misses • Branch prediction • Parameters of the monitoring module: • Buffer size (fixed) • Polling interval (fixed) • Sampling interval (adaptive) • Keep runtime overhead constant by changing interval during run-time automatically

From raw data to Java • Determine method + bytecode instr • Build sorted method table • Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL

From raw data to Java • Sample gives PC + register contents • PC  machine code  compiled Java code  bytecode instruction • For data address: use registers + machine code to calculate target address: • GETFIELD  indirect load mov 12(eax), eax // 12 = offset of field

Engineering issues • Lookup of PC to get method / BC instr must be efficient • Done in parallel with user program • Use binary search / hash table • Update at recompilation, GC • Identify 100% of instructions (PCs): • Include samples from application, VM, and library code • Dealing with native parts

Infrastructure • Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform • Pentium 4, 3 GHz, 1G RAM, 1M L2 cache • Measured data show: • Runtime overhead • Extraction of meaningful information

Runtime overhead • Experiment setup: monitor L2 cache misses

Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles

Measurements • Measure which instructions produce most events (cache misses, branch mispred) • Potential for data locality and control flow optimizations • Compare different spec-benchmarks • Find “hot spots”: instructions that produce 80% of all measured events

L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)

L1/L2 Cache misses 80% quantile = 477 (N=8526) 80% quantile = 76 (N=2361)

L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)

Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)

Summary • Distribution of events over program differ significantly between benchmarks • Challenge: Are data precise enough to guide optimizations in a dynamic compiler?

Further work • Apply information in optimizer • Data: access path expressions p.x.y • Control flow: inlining, I-cache locality • Investigate flexible sampling interval • Further optimizations of monitoring system • Replacing expensive JNI calls • Avoid copying of samples

Concluding remarks • Precise performance event monitoring is possible with low overhead (~ 2%) • Monitoring infrastructure tied into Jikes RVM compiler • Instruction level information allows optimizations to focus on “hot spots” • Good platform to study coupling compiler decisions to hardware-specific platform properties

Using Platform-Specific Performance Counters for Dynamic Compilation

Using Platform-Specific Performance Counters for Dynamic Compilation

Presentation Transcript

Specific Performance

Using Counters

Demand-Driven Software Race Detection using Hardware Performance Counters

Partial Method Compilation using Dynamic Profile Information

Measuring Media Gateway Software Efficiency Using Performance Monitor Counters

Platform Specific Game Mechanic Implementation

CFIMon : Detecting Violation of Control Flow Integrity using Performance Counters

Using specific praise

Dynamic Compilation and Optimization

Calpa: A Tool for Automating Dynamic Compilation

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

Using Data Dictionary and Dynamic Performance Views

Using Data Dictionary and Dynamic Performance Views

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters

Targeting Dynamic Compilation for Embedded Systems

Dynamic Building of Domain Specific Lexicons Using Emergent Semantics

Dynamic Compilation and Modification

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Using Data Dictionary and Dynamic Performance Views