Knowledge Support for Mining Parallel Performance Data

Knowledge Support for MiningParallel Performance Data Allen D. Malony, Kevin Huck {malony,khuck}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon

Outline • Why mine parallel performance data? • Our first attempt • PerfDMF • PerfExplorer • How did we do? Why knowledge-driven data mining? • PerfExplorer v2 • Analysis process automation • Metadata encoding and incorporation • Inference engine • Object persistence and provenance • Analysis examples

Motivation for Performance Data Mining • High-end parallel applications and systems evolution • More sophisticated, integrated, heterogeneous operation • Higher levels of abstraction • Larger scales of execution • Evolution trends change performance landscape • Parallel performance data becomes more complex • Multivariate, higher dimensionality, heterogeneous • Greater scale and larger data size • Standard analysis techniques overwhelmed • Need data management and analysis automation • Provide foundation for performance analytics

Performance Data Mining Objectives • Conduct parallel performance analysis in a systematic, collaborative and reusable manner • Manage performance data and complexity • Discover performance relationship and properties • Automate performance investigation process • Multi-experiment performance analysis • Large-scale performance data reduction • Summarize characteristics of large processor runs • Implement extensible analysis framework • Abtraction / automation of data mining operations • Interface to existing analysis and data mining tools

Performance Data Management (PerfDMF) gprof cubempiP O|SSpsrun HPMToolkit… K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005.

Analysis Framework (PerfExplorer) • Leverage existing TAU infrastructure • Focus on parallel profiles • Build on PerfDMF • Support large-scale performance analysis • Multiple experiments • Parametric studies • Apply data mining operations • Comparative, clustering, correlation, dimension reduction, … • Interface to existing tools (Weka, R) • Abstraction/automation

Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

Relative Comparisons • Total execution time • Timesteps per second • Relative efficiency • Relative efficiency per event • Relative speedup • Relative speedup per event • Group fraction of total • Runtime breakdown • Correlate events with total runtime • Relative efficiency per phase • Relative speedup per phase • Distribution visualizations Data: GYRO on various architectures

Cluster Analysis count PCA scatterplot min avg max PerfDMF databases topology Data: sPPM on Frost (LLNL), 256 threads

Correlation Analysis Strong negative linear correlation betweenCALC_CUT_BLOCK_CONTRIBUTIONSand MPI_Barrier Data: FLASH on BGL(LLNL), 64 nodes

4-D Visualization 4 “significant” events areselected clusters and correlations are visible Data: FLASH on BG/L (LLNL), 1024 nodes

PerfExplorer Critique (Describe vs. Explain) • Specific parametric study support (not general) • No way to capture the analysis processes • No analysis history - how were these results generated? • PerfExplorer just redescribed the performance results • PerfExplorer should explain performance phenomena • What are the causes for performance observed? • What are the factors and how do they interrelate? • Performance analytics, forensics, and decision support • Automated analysis needs good informed feedback • Iterative tuning, performance regression testing • Performance model generation requires interpretation

How to explain behavior? Add Knowledge! • Offline parallel performance tools should not have to treat the application and system as a “black box” • Need to add knowledge to do more intelligent things • Where does it come from? • Experiment context • Application-specific information • System-specific performance • General performance expertise • We need better methods and tools for • Integrating meta-information • Knowledge-based performance problem solving

Metadata and Knowledge Role in Analysis Context Knowledge You have to capture these... Source Code Build Environment Run Environment Performance Problems Execution Application Machine Performance Knowledge ...to understand this Performance Result

Example: Sweep3D Domain Decomposition • Wavefront evaluation with a recursion dependence in all 3 grid directions • Edge cells: 3 neighbors • Corner cells: 2 neighbors • Communication is affected 12 13 14 15 4x4 example: 8 9 10 11 2 neighbors 4 5 6 7 3 neighbors 0 1 2 3 4 neighbors Center cells Corner cells Data: Sweep3D on Linux Cluster, 16 processes

PerfExplorer v2 – Requirements and Features • Component-based analysis process • Analysis operations implemented as modules • Linked together in analysis process and workflow • Scripting • Provides process/workflow development and automation • Metadata input, management, and access • Inference engine • Reasoning about causes of performance phenomena • Analysis knowledge captured in expert rules • Persistence of intermediate results • Provenance • Provides historical record of analysis results

PerfExplorer v2 – Design new new new new new new new

Component Interaction

Analysis Components for Scripting • Analysis components implement data mining operations • Support easy to use interfaces (Java)

Embedded Scripting • Jython (a.k.a. JPython) scripting provides API access to Java analysis components • Makes new analyses and processes easier to program • Allows for repeatable analysis processing • Provides for automation • Multiple datasets • Tuning iteration • Supports workflow creation • Could use other scripting languages (JRuby, Jacl, …) print "--------------- JPython test script start ------------" # create a rulebase for processing ruleHarness = RuleHarness.useGlobalRules("rules/GeneralRules.drl") ruleHarness.addRules("rules/ApplicationRules.drl") ruleHarness.addRules("rules/MachineRules.drl") # load the trial and get the metadata Utilities.setSession("apart") trial = Utilities.getTrial("sweep3d", "jaguar", "16") trialResult = TrialResult(trial) trialMetadata = TrialThreadMetadata(trial) # extract the top 5 events getTop5 = TopXEvents(trial, trial.getTimeMetric(), AbstractResult.EXCLUSIVE, 5) top5 = getTop5.processData().get(0) # correlate the event data with metadata correlator = CorrelateEventsWithMetadata(top5, trialMetadata) output = correlator.processData().get(0) RuleHarness.getInstance().assertObject(output); # process rules and output result RuleHarness.getInstance().processRules() print "---------------- JPython test script end -------------"

Loading Metadata into PerfDMF • Three ways to incorporate metadata • Measured hardware/system information (TAU, PERI-DB) • CPU speed, memory in GB, MPI node IDs, … • Application instrumentation (application-specific) • TAU_METADATA() used to insert any name/value pair • Application parameters, input data, domain decomposition • PerfDMF data management tools can incorporate an XML file of additional metadata • Compiler flags, submission scripts, input files, … • Metadata can be imported from / exported to PERI-DB • PERI SciDAC project (UTK, NERSC, UO, PSU, TAMU) • Performance data and metadata integration

Metadata into PerfDMF Profile Data 3 metadata.xml Profile Data Profile Data Build metadata Profile Data 1 Auto-collected metadata Runtime metadata Submission metadata User-specified metadata 2 Other metadata PerfDMF Performance measurements

Metadata in PerfExplorer • GTC on 1024 processors of Jaguar (Cray XT3/4)

Inference Engine • Metadata and analysis results are asserted as facts • Examples: number of processors, an input parameter, a derived metric, a speedup measurement • Analysis rules with encoded “expert knowledge” of performance process the assertions • Example: “When processor count increases by a factor of X, runtime should reduce by a factor of Y” (expectation) • Example: “When an event’s cache hit ratio is less than the overall ratio, alert the user to the event” (criticality) • Processed rules can assert new facts, which can fire new rules - provides a declarative programming environment • JBoss Rules rules engine for Java • Implements efficient Rete algorithm

Metadata and Inference Rules

Inference Rules with Application Metadata • JBoss Rules has Java-like syntax Conditions are tested variables bound rule "Differences in Particles Per Cell" when // there exists a difference operation between metadata collections d : DifferenceMetadataOperation () f : FactWrapper ( factName == "input:micell", factType == DifferenceMetadataOperation.NAME ) then String[] values = (String[])d.getDifferences().get("input:micell"); System.out.println("Differences in particles per cell... " + values[0] + ", " + values[1]); double tmp = Double.parseDouble(values[0]) / Double.parseDouble(values[1]); // an increase in particles per cell means an increase in time d.setExpectedRatio(d.getExpectedRatio() / tmp); System.out.println("New Expected Ratio: " + d.getExpectedRatio()); d.removeFact("input:micell"); end If conditions met, execute code (full access to public Java objects) Fact is de-asserted (removed)

Data Persistence and Provenance • Analysis results should include where they came from • Data persistence captures intermediate analysis results and final results and saves for later access • Persistence allows analysis results to be reused • Some analysis operations can take a long time • Breadth-wise inference analysis and cross-workflow • Storing all operations in the analysis workflow generates full provenance of the intermediate and final results • Supports confirmation and validation of analysis results • Inference engine may need as well • Persistence/Provence used to create “chain of evidence”

Example: GTC (Gyrokinetic Toroidal Code) • Particle-in-cell simulation • Fortran 90, MPI • Main events: • PUSHI - update ion locations • CHARGEI - calculate ion gather-scatter coefficients • SHIFTI - redistribute ions across processors • Executed on ORNL Jaguar • Cray XT3/XT4, 512 processors • Problem: • ions are accessed regularly • grid cells have poor cache reuse • Scripted analysis, inference rules Figure: The (turbulent) electrostatic potential from a GYRO simulation of plasma microturbulence in the DIII-D tokamak. Source: http://www.scidac.gov/FES/FES_PMP/reports/PMP2004Annual.html Data: GTC on XT3/XT4 with 512 processes

Example: Workflow Load Data Extract Non-callpath data Extract top 10 events Merge events Extract main event Derive metrics Compare events to main Process inference rules

Example: Output doing single trial analysis for gtc on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... The event SHIFTI [{shifti.F90} {1,12}] has a lower than average L2 hit rate. Try improving cache reuse for improved performance. Average L2 hit rate: 0.9300473381408862, Event L2 hit rate: 0.7079725336828661 Percentage of total runtime: 06.10% The event SHIFTI [{shifti.F90} {1,12}] has a lower than average L1 hit rate. Try improving cache reuse for improved performance. Average L1 hit rate: 0.9927455234580467, Event L1 hit rate: 0.9792368593502352 Percentage of total runtime: 06.10% The event PUSHI [{pushi.f90} {1,12}] has a higher than average FLOP rate. This appears to be computationally dense region. If this is not a main computation loop, try performing fewer calculations in this event for improved performance. Average MFLOP/second: 1101.4060017516147, Event MFLOP/second: 1615.1927713500654 Percentage of total runtime: 50.24% The event CHARGEI [{chargei.F90} {1,12}] has a lower than average L1 hit rate. Try improving cache reuse for improved performance. Average L1 hit rate: 0.9927455234580467, Event L1 hit rate: 0.9923999967335222 Percentage of total runtime: 37.70% ...done with rules. Identified poor cache reuse Identified main computation

Example: Sweep3D • ASCI Benchmark code • 256 processors, with 800x800x1000 problem • Problem: number of logical neighbors in decomposition will determine communication performance • Problem: • XT3/XT4 imbalances • Scripted analysis, metadata, inference rules Data: Sweep3D on XT3/XT4 with 256 processes

Example: Workflow Load Data Extract Non-callpath data Extract top 5 events Load metadata Correlate events With metadata Process inference rules

Example: Output --------------- JPython test script start ------------ doing single trial analysis for sweep3d on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct). MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct). MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is 0.8596 (moderate). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is -0.9792 (very high). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is -0.9785258663321764 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9818810020169854 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9810373923601381 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is 0.9985297567878844 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is 0.996415213842904 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9980107779462387 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9959749668655212 (very high). ...done with rules. ---------------- JPython test script end ------------- Correlated communication behavior with metadata Identified hardware differences

Example: GTC Scalability • Weak scaling example • Comparing 64 and 128 processes • Superlinear speedup observed • Can PerfExplorer detect and explain why? • Uses scripted analysis, metadata, inference rules Data: GTC on XT3/XT4 with 64 through 512 processes

Example: Workflow Load Data Extract Non-callpath data Load Data Load Metadata Compare Metadata Load Metadata Extract top 5 events Do Scalability Comparison Process inference rules

Example Output 64 run 128 run --------------- JPython test script start ------------ doing single trial analysis for gtc on jaguar << OUTPUT DELETED >> Firing rules... Differences in processes... 64, 128 New Expected Ratio: 0.5 Differences in particles per cell... 100, 200 New Expected Ratio: 1.0 The comparison trial has superlinear speedup, relative to the baseline trial Expected ratio: 1.0, Actual ratio: 1.0520054892787758 Ratio = baseline / comparison Event / metric combinations which may contribute: CHARGEI [{chargei.F90} {1,12}] PAPI_L1_TCM 1.1399033611634473 CHARGEI [{chargei.F90} {1,12}] PAPI_L2_TCM 1.3397803892450724 MPI_Allreduce() PAPI_L1_TCA 1.5067129486366324 MPI_Allreduce() PAPI_TOT_INS 1.9426312371742902 MPI_Sendrecv() PAPI_L2_TCM 1.0639218768220913 MPI_Sendrecv() PAPI_TOT_INS 1.054645735019196 PUSHI [{pushi.f90} {1,12}] PAPI_L1_TCM 1.0879723996543027 PUSHI [{pushi.f90} {1,12}] PAPI_L2_TCM 1.5442028731182402 ---------------- JPython test script end ------------- Identified weak scaling Identified superlinear speedup Identified possible reasons

Current Status • Working now • Several analysis operations are written • Metadata collection is available • Scripting and inference engine are in place • To be developed • Results persistence • Provenance capture and storage

Conclusion • Mining performance data will depend on adding knowledge to analysis frameworks • Application, hardware, environment metadata • Analysis processes and workflow • Rules for inferencing and analysis search • Expert knowledge combined with performance results can explain performance phenomena • Redesigned PerfExplorer framework is one approach • Community performance knowledge engineering • Developing inference rules • Constructing analysis processes • Application-specific metadata and analysis

Acknowledgments • US Department of Energy (DOE) • Office of Science • MICS, Argonne National Lab • NSF • Software and Tools for High-End Computing • PERI SciDAC • PERI-DB project • TAU and PerfExplorer demos: NNSA / ASC, booth #1617, various times daily • PERI-DB demo: RENCI booth #3215, Wednesday at 2:30 PM

Knowledge Support for Mining Parallel Performance Data