250 likes | 350 Vues
Explore how to collect and utilize performance data from hardware counters for optimizing dynamic compilers on common execution platforms. Learn about implementation details, monitoring modules, data interpretation, and runtime overhead considerations.
E N D
Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich
Introduction & Motivation • Dynamic compilers common execution platform for OO languages (Java, C#) • Properties of OO programs difficult to analyze at compile-time • JIT compiler can immediately use information obtained at run-time
Introduction & Motivation Types of information: • Profiles: e.g. execution frequency of methods / basic blocks • Hardware-specific properties: cache misses, TLB misses, branch prediction failures
Outline • Introduction • Requirements • Related work • Implementation • Results • Conclusions
Requirements • Infrastructure flexible enough to measure different execution metrics • Hide machine-specific details from VM • Keep changes to the VM/compiler minimal • Runtime overhead of collecting information from the CPU low • Information must be precise to be useful for online optimization
Related work • Profile guided optimization • Code positioning [PettisPLDI90] • Hardware performance monitors • Relating HPM data to basic blocks [Ammons PLDI97] • “Vertical profiling” [Hauswirth OOPSLA 2004] • Dynamic optimization • Mississippi delta [Adl-Tabatabai PLDI2004] • Object reordering [Huang OOPSLA 2004] • Our work: • No instrumentation • Use profile data + hardware info • Targets fully automatic dynamic optimization
Hardware performance monitors • Sampling-based counting • CPU reports state every n events • Precision platform-dependent (pipelines, out-of-order execution) • Sampling provides method, basic block, or instruction-level information • Newer CPUs support precise sampling (e.g. P4, Itanium)
Hardware performance monitors • Way to localize performance bottlenecks • Sampling interval determines how fine-grained the information is • Smaller sampling interval more data • Trade-off: precision vs. runtime overhead • Need enough samples for a representative picture of the program behavior
Implementation Main parts • Kernel module: low level access to hardware, per process counting • User-space library: hides kernel & device driver details from VM • Java VM thread: collects samples periodically, maps samples to Java code • Implemented on top of Jikes RVM
Implementation • Supported events: • L1 and L2 cache misses • DTLB misses • Branch prediction • Parameters of the monitoring module: • Buffer size (fixed) • Polling interval (fixed) • Sampling interval (adaptive) • Keep runtime overhead constant by changing interval during run-time automatically
From raw data to Java • Determine method + bytecode instr • Build sorted method table • Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL
From raw data to Java • Sample gives PC + register contents • PC machine code compiled Java code bytecode instruction • For data address: use registers + machine code to calculate target address: • GETFIELD indirect load mov 12(eax), eax // 12 = offset of field
Engineering issues • Lookup of PC to get method / BC instr must be efficient • Done in parallel with user program • Use binary search / hash table • Update at recompilation, GC • Identify 100% of instructions (PCs): • Include samples from application, VM, and library code • Dealing with native parts
Infrastructure • Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform • Pentium 4, 3 GHz, 1G RAM, 1M L2 cache • Measured data show: • Runtime overhead • Extraction of meaningful information
Runtime overhead • Experiment setup: monitor L2 cache misses
Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles
Measurements • Measure which instructions produce most events (cache misses, branch mispred) • Potential for data locality and control flow optimizations • Compare different spec-benchmarks • Find “hot spots”: instructions that produce 80% of all measured events
L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)
L1/L2 Cache misses 80% quantile = 477 (N=8526) 80% quantile = 76 (N=2361)
L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)
Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)
Summary • Distribution of events over program differ significantly between benchmarks • Challenge: Are data precise enough to guide optimizations in a dynamic compiler?
Further work • Apply information in optimizer • Data: access path expressions p.x.y • Control flow: inlining, I-cache locality • Investigate flexible sampling interval • Further optimizations of monitoring system • Replacing expensive JNI calls • Avoid copying of samples
Concluding remarks • Precise performance event monitoring is possible with low overhead (~ 2%) • Monitoring infrastructure tied into Jikes RVM compiler • Instruction level information allows optimizations to focus on “hot spots” • Good platform to study coupling compiler decisions to hardware-specific platform properties