Cross-Architecture Performance Modeling for Scientific Applications
Explore performance predictions for scientific applications using parameterized models. Learn about edge frequencies, node weights, static and dynamic analyses, memory access modeling, and predicting performance.
Cross-Architecture Performance Modeling for Scientific Applications
E N D
Presentation Transcript
Performance Modeling Cross-Architecture Performance Predictions for Scientific Applications Using Parameterized Models Marin and Mellor-Crummey
The Gist • Performance modeling is important (theoretical problem sizes on theoretical machines) • Build a model that will: • Create execution graph • Model edge frequencies • Model node weights 5 2 2 1 5 2 7 4 5 • Use model to predict performance 2*5 + 2*1 + 5*4 + 5*1 + 2*7 = 51
Why Model Performance? • Procurement: Which new machine should I buy? • Machine Design: Which machine should I produce? • Application Tuning: Why does my application scale like it does on this machine and how can I help?
Static Analysis • Create a Control Flow Graph (CFG) • Identify instruction mix • Identify Dependencies • Some are easy e.g. add R1 R2 R3 add R4 R1 R2 • Some more difficult e.g. ld R1 1000(R2) sw R3 1000(R2) Solution: “Access formulas” – good idea? Dependence??
Dynamic Analysis: Execution Freq. • Get edge counts with minimum overhead • Place one counter on each loop • Construct spanning tree including uninstrumentable edges • Place counters on remaining 2 2 5 2 5
Modeling Execution Frequencies • How do the edge weights change with larger input parameters? • Make multiple runs and vary only one parameter at a time • Interpolate a function for each edge and parameter ?? Linear Combination Edge weight with respect to N Edge weight with respect to Z
Dynamic Analysis: Memory Access Calculate reuse distance histograms • Reuse Distance – How many distinct memory addresses have been accessed since I last accessed this one? • Reuse distance distribution: E.g. - 25% of mem refs have reuse dist. = 2 12% of mem refs have reuse dist. = 3 .. Etc. • Tree structure to hold each mem. Ref • Sort key is time step of last access • Slowdown? No sampling. • Total time is O(MlogN) where M = #memRefs • Memory Requirement?
Modeling Memory Access • Need to model histograms, not just avg. dist • Use same 1-variable approach as before
Modeling Memory Access (cont.) • How many bins to use? • Different problem sizes have different bins but we need to normalize • Accuracy and complexity increase with bin count • For each bin, model: • number of accesses vs. problem size • avg. reuse distance vs. problem size
Map Model to Target Architecture • Combine the CFG’s and path frequency information we’ve gathered • Translate code to generic RISC instructions • Generic scheduler, initialized with a machine description, predicts the runtime of this generic RISC code (difficult) • Assume all memory refs hit in cache
Adding the Memory Model • Use Reuse distances to estimate cache misses (dist > # of blocks) and add miss penalties • Scheme assumes fully associative cache (doesn’t account for conflict misses)
The Gist – again… • Performance modeling is important (theoretical problem sizes on theoretical machines) 5 • Build a model that will: • Create execution graph • Model edge frequencies • Model node weights 2 2 1 5 2 7 4 5 • Remaining difficulties: • How to understand affect of parameters • Difficult to predict dependencies and instruction scheduling • Different compilers can cause performance variations • Cost of gathering reuse distances very high – possible for big apps? • Conflict misses not modeled • Penalty of cache misses not clear