Cross-Architecture Performance Modeling for Scientific Applications

Performance Modeling Cross-Architecture Performance Predictions for Scientific Applications Using Parameterized Models Marin and Mellor-Crummey

The Gist • Performance modeling is important (theoretical problem sizes on theoretical machines) • Build a model that will: • Create execution graph • Model edge frequencies • Model node weights 5 2 2 1 5 2 7 4 5 • Use model to predict performance 2*5 + 2*1 + 5*4 + 5*1 + 2*7 = 51

Why Model Performance? • Procurement: Which new machine should I buy? • Machine Design: Which machine should I produce? • Application Tuning: Why does my application scale like it does on this machine and how can I help?

Static Analysis • Create a Control Flow Graph (CFG) • Identify instruction mix • Identify Dependencies • Some are easy e.g. add R1 R2 R3 add R4 R1 R2 • Some more difficult e.g. ld R1 1000(R2) sw R3 1000(R2) Solution: “Access formulas” – good idea? Dependence??

Dynamic Analysis: Execution Freq. • Get edge counts with minimum overhead • Place one counter on each loop • Construct spanning tree including uninstrumentable edges • Place counters on remaining 2 2 5 2 5

Modeling Execution Frequencies • How do the edge weights change with larger input parameters? • Make multiple runs and vary only one parameter at a time • Interpolate a function for each edge and parameter ?? Linear Combination Edge weight with respect to N Edge weight with respect to Z

Dynamic Analysis: Memory Access Calculate reuse distance histograms • Reuse Distance – How many distinct memory addresses have been accessed since I last accessed this one? • Reuse distance distribution: E.g. - 25% of mem refs have reuse dist. = 2 12% of mem refs have reuse dist. = 3 .. Etc. • Tree structure to hold each mem. Ref • Sort key is time step of last access • Slowdown? No sampling. • Total time is O(MlogN) where M = #memRefs • Memory Requirement?

Modeling Memory Access • Need to model histograms, not just avg. dist • Use same 1-variable approach as before

Modeling Memory Access (cont.) • How many bins to use? • Different problem sizes have different bins but we need to normalize • Accuracy and complexity increase with bin count • For each bin, model: • number of accesses vs. problem size • avg. reuse distance vs. problem size

Map Model to Target Architecture • Combine the CFG’s and path frequency information we’ve gathered • Translate code to generic RISC instructions • Generic scheduler, initialized with a machine description, predicts the runtime of this generic RISC code (difficult) • Assume all memory refs hit in cache

Adding the Memory Model • Use Reuse distances to estimate cache misses (dist > # of blocks) and add miss penalties • Scheme assumes fully associative cache (doesn’t account for conflict misses)

Predicting Performance

The Gist – again… • Performance modeling is important (theoretical problem sizes on theoretical machines) 5 • Build a model that will: • Create execution graph • Model edge frequencies • Model node weights 2 2 1 5 2 7 4 5 • Remaining difficulties: • How to understand affect of parameters • Difficult to predict dependencies and instruction scheduling • Different compilers can cause performance variations • Cost of gathering reuse distances very high – possible for big apps? • Conflict misses not modeled • Penalty of cache misses not clear

Cross-Architecture Performance Modeling for Scientific Applications