Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04

Guiding Ispike with Instrumentation and Hardware (PMU) ProfilesCGO’04 Tutorial3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design Center Intel Corporation

What is Ispike? • A post-link optimizer for Itanium/Linux • No source code required • Memory-centric optimizations: • Code layout + prefetching, data layout + prefetching • Significant speedups over compiler-optimized programs: • 10% average speedup over gcc –O3 on SPEC CINT 2000 • Profile usages: • Understanding program characteristics • Driving optimizations automatically • Evaluating the effectiveness of optimizations CGO’04 Tutorial

Profiles used by Ispike CGO’04 Tutorial

Profile Example: D-EAR (cache) Top 10 loads in the D-EAR profile of the MCF benchmark Total sampled miss latency latency buckets CGO’04 Tutorial

Profile Analysis Tools • A set of tools written for visualizing and analyzing profiles, e.g.,: • Control flow graph (CFG) viewer • Code-layout viewer • Load-latency comparator CGO’04 Tutorial

CFG Viewer C For evaluating the accuracy of profiles CGO’04 Tutorial

Code-layout Viewer C For evaluating code-layout optimization CGO’04 Tutorial

Load-latency Comparator • For evaluating data-layout optimization and data prefetching CGO’04 Tutorial

Deriving New Profiles from PMUs • New profile types can be derived from PMUs • Two examples: • Consumer stall cycles • D-cache miss strides CGO’04 Tutorial

Consumer Stall Cycles Basic block A PC-sample count N1 I1: ld8 r2 = [r3];; /* other instructions */ I2: add r2 = r2, 1;; I3: st8 [r3] = r2 Question: • How many cycles of stall experienced by I2? (Note: not necessarily the load latency of I1) Method: • PC-sample count is proportional to (stall cycles * frequency) N2 N3 CGO’04 Tutorial

Example: Two strided loads in MCF arc* arcin; node* tail; … while (arcin) { tail = arcin->tail; … arcin = tail->mark; } -192B -192B -192B arcin tail -120B -120B -120B D-cache Miss Strides Problem: • Detect strides that are statically unknown CGO’04 Tutorial

Skipping phases (1 sample per 1000 misses) Time Inspection phases (1 sample per miss) A1 A2 A3 A4 Time A4-A3=3*48=144 A2-A1=5*48=240 A3-A2=7*48=336 Use GCD to figure out strides from miss addresses: A1, A2, A3, A4 are four consecutive miss addresses of a load. The load has a stride of 48 bytes. GCD(A2-A1, A3-A2 )=GCD(240,336)=48 GCD(A3-A2, A4-A3 )=GCD(336,144)=48 D-EAR based Stride Profiling • Sample load misses with 2 phases: CGO’04 Tutorial

Performance Evaluation • Instrumentation vs. PMU profiles: • Profiling overhead • Performance impact • Ispike optimizations: • Code layout, instruction prefetching, data layout, data prefetching, inlining, global-data optimization, scalar optimizations • Baseline compilers: • Intel Electron compiler (ecc), version 8.0 Beta, -O3 • GNU C compiler (gcc), version 3.2, -O3 • Benchmarks: • SPEC CINT2000 (profiled with “training”, measured with “reference”) • System: • 1GHz Itanium 2, 16KB L1I/16KB L1D, 256KB L2, 3MB L3, 16GB memory • Red Hat Enterprise Linux AS with 2.4.18 kernel CGO’04 Tutorial

Performance Gains with PMU Profiles BTB (1 sample/10K branches), D-EAR cache (1 sample/100 load misses) D-EAR stride (1 sample /100 misses in skipping, 1 sample/miss in inspection) • Up to 40% gain • Geo. means: 8.5% over Ecc and 9.9% over Gcc Gcc3.2 –O3 baseline Ecc8.0 –O3 baseline CGO’04 Tutorial

Cycle Breakdown (Ecc Baseline) • Help understand if individual optimizations are doing a good job CGO’04 Tutorial

PMU Profiling Overhead • Overhead reduced from 58% to 23% when lowering the BTB sampling rate by 10x. • Overhead reduced to 3% when lowering the D-EAR sampling rate by 10x. CGO’04 Tutorial

Instrumentation Profiling Overhead Why is the overhead so large? • Training runs are too short to amortize the dynamic compilation cost • Techniques like ephemeral instrumentation yet to be applied CGO’04 Tutorial

PMU vs. Instrumentation (Perf. Gains) profiling overhead >60x • PMU profiles can be as good as instrumentation profiles • Could be even better in some cases (e.g., mcf) • However, possible performance drops when samples are too sparse • E.g., gap and parser when Stride = <1/1000, 1/1> 59% 24% 3% CGO’04 Tutorial

Reference “Ispike: A Post-link Optimizer for the Intel Itanium Architecture”, by Luk et. al. In Proceedings of CGO’04. http://www.cgo.org/papers/01_82_luk_ck.pdf CGO’04 Tutorial

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04