1 / 37

Static Identification of Delinquent Loads

Static Identification of Delinquent Loads. V.M. Panait Sasturkar W.-F. Fong. Agenda. Introduction Related Work Delinquent Loads Framework Address Patterns, Decision Criteria The heuristic: types of classes, computing the weights, final classes Results. Introduction.

vic
Télécharger la présentation

Static Identification of Delinquent Loads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Static Identification of Delinquent Loads V.M. Panait Sasturkar W.-F. Fong

  2. Agenda • Introduction • Related Work • Delinquent Loads • Framework • Address Patterns, Decision Criteria • The heuristic: types of classes, computing the weights, final classes • Results

  3. Introduction • Cache – one of the major current bottlenecks in performance • One approach: prefetch; but prefetch what ? Can’t prefetch everything… • Few loads are really “bad” – “delinquent loads” • This paper: classification of address patterns in the load instructions

  4. Introduction • Done after code generation, but before runtime • Singled out 10% of all loads causing over 90% of the misses in 18 SPEC benchmarks • Gets even better combined with basic block profiling: 1.3% loads covering over 80% of the misses

  5. Related Work • BDH method: classify loads based on following criteria: • Region of memory accessed by the load: S (stack), H (heap) or G (global). • Kind of reference: loading a scalar (S), element of array (A) or field of a structure (S) • Type of reference: (P)ointer or (N)ot.

  6. Related Work • Some classes account for most misses: GAN, HSN, HFN, HAN, HFP, HAP. • The OKN method: 3 simple heuristics • Use of a pointer dereference • Use of a strided reference • None of the above • This paper is much more precise than both above methods

  7. Delinquent Loads • Why not stores too ? Write buffers are apparently good enough • Why not do it in hardware ? They do, but: • Need additional specialized hardware • Complex decisions (fast) <-> complex hardware • Memory profiling: not always practical

  8. Delinquent Loads & Profiling

  9. Framework • Assembly code -> address patterns for each load instruction -> placement of the load instruction in a class • Classes + weights -> heuristic function • If the value of the heuristic is greater than a delinquency threshold, the instruction is classified as possibly delinquent

  10. Address Patterns • Address Pattern = summary of how the source address of the load instruction is computed • Uses CFG and DF analysis (reaching definitions) (one address pattern for each control path reaching the load) • Only uses basic registers (BR): gp, sp, regparam, regret

  11. The Decision Criteria • Classes are derived from these criteria • H1: Register usage in an address pattern (usage of BR’s) • H2: Type of operations used in address computation (arithmetic, logic) • H3: Maximum level of dereferencing

  12. The Decision Criteria • H4: Recurrence (iterative walk through memory) • H5: Execution frequency – based on BB profiling; classifies loads as: • Rarely executed (used here as negative) • Seldom executed (idem) • Fairly often executed (not used here) • In a program hotspot

  13. Decision Criteria and Classes • Each criterion results in a set of classes • Class = set of address patterns with a certain property • There are too many classes that can result; only some are considered, and some of those are also aggregated into one class

  14. Decision Criteria and Classes • H1 – based classes: enumerations of the number of occurrences of each of the 4 BR’s in an address pattern • H2 – based classes: address patterns with multiplications and shift operations • H3 – based classes: as many as there are levels of dereferencing in the address patterns

  15. Decision Criteria and Classes • H4 – based classes: two classes (address pattern involves recurrence or not) • H5 – based classes: three classes: rarely, seldom and program hotspot

  16. Experimental Setup • SimpleScalar toolkit: cache simulator (for cache hits & misses), compiler, objdump • Procedure: Fortran -> C code (via f2c) -> MIPS executable (via C2MIPS compiler) -> disassembled code (via objdump) • Reconstruction of CFG and DF analysis

  17. Experimental Setup • 2 stages: learning/training and experimental (actual) • Stage 1: get full memory profiling data on a subset of SPEC benchmarks, use it to compute weights for each class • Use the heuristic thus obtained on a new subset of benchmarks

  18. The Heuristic: Types of Classes • Three types of classes: • Positive (loads in it are likely delinquent) • Negative (… not …) • Neutral • Positive classes have positive weights, negative ones have negative weights, neutral classes have a weight of zero

  19. The Heuristic: Terminology • The miss probability of class F in benchmark j: • The amount of misses accounted for by members of class F in benchmark j:

  20. The Heuristic: Terminology • mj(F,C) = likelihood of an instruction of class F in benchmark j to be a cache miss • However, if that instruction is only executed once, it won’t be a delinquent load • nj(F,C) = proportion out of total number of misses that members of F account for

  21. The Heuristic: Terminology • Strength index: r = mj / nj • A benchmark j is irrelevant to a class F if both indices mj and nj are below certain thresholds. Otherwise it is relevant. • Positive class: r > 5% for all benchs. • Negative class: nj < 0.5% for all benchs. • Neutral class: r < 5% for 1+ benchs.

  22. Computing the Weights • Form classes according to the five decision criteria • Compute mj, nj for each class • Weight of class Fk

  23. Computing the Weights • This is the formula for positive classes only • Only relevant benchmarks are included in the formula • |.| is the cardinality of that set, i.e. the number of benchmarks relevant to that class

  24. Aggregate Classes • AG1: both gp and sp are used 1+ each (comes from H1) • AG2: only sp used 2+ (H1) • AG3: either * or shifts are used (H2) • AG4: one level dereferencing (H3) • AG5: two level dereferencing (H3) • AG6: three level dereferencing (H3)

  25. Aggregate Classes • AG7: address patterns containing a recurrence (H4) • AG8: loads with low frequency of execution (100 < f < 1000) (H5) • AG9: loads with fairly low frequency of execution (f < 100 times) (H5) • Weight formula for negative classes: negated mean of positive weights

  26. The Heuristic Function 1 if 0 otherwise the load is delinquent

  27. Precision and Coverage • Precision of a heuristic scheme H, (H): the (correct) number of loads that scheme H identifies as delinquent (the lower, i.e., closer to the real one, the better) • Coverage of a heuristic scheme H, (H): the number of cache misses caused by loads identified as delinquent by scheme H (the closer to 100%, the better)

  28. Results on different inputs

  29. Results when varying cache associativity

  30. Results when varying cache size

  31. Performance on new benchmarks

  32. Performance summary

  33. Performance of OKN & BDH

  34. Performance with various 

  35. Combination with BB profiling • Use the heuristic to sharpen the set returned by BB profiling • Also add loads that are not in the hotspots •  is the percentage of the highest scoring loads detected by our method but not by profiling that we consider to be delinquent

  36. Combination with BB profiling

  37. Conclusions • The static scheme for identifying delinquent loads has a precision of 10% and coverage of over 90% over 18 benchmarks • More precise than related work, similar coverage • Immune to variation of framework parameters (e.g. cache size, assoc., input)

More Related