Automatic Performance Diagnosis of Parallel Computations with Compositional Models

Automatic Performance Diagnosis of Parallel Computations with Compositional ModelsLi Li, Allen D. Malony{lili, malony}@cs.uoregon.eduPerformance Research LaboratoryDep. of Computer and Information ScienceUniversity of Oregon

Parallel Performance Diagnosis • Performance tuning process • Process to find and fix performance problems • Performance diagnosis: detect and explain problems • Performance optimization: repair found problems • Diagnosis is critical to efficiency of performance tuning • Focus on the performance diagnosis • Capture diagnosis processes • Integrate with performance experimentation and evaluation • Formalize the (expert) performance cause inference • Support diagnosis in an automated manner

Generic Performance Diagnosis Process • Design and run performance experiments • Observe performance under a specific circumstance • Generate desirable performance evaluation data • Find symptoms • Observation deviating from performance expectation • Detect by evaluating performance metrics • Infer causes of symptoms • Relate symptoms to program • Interpret symptoms at different levels of abstraction • Iterate the process to refine performance bug search • Refine performance hypothesis based on symptoms found • Generate more data to validate the hypothesis

Knowledge-Based Automatic Performance Diagnosis • Experts analyze systematically and use experience • Implicitly use knowledge of code structure and parallelism • Guide by the knowledge to conduct diagnostic analysis • Knowledge-based approach • Capture knowledge about performance problems • Capture knowledge about how to detect and explain them • Apply the knowledge to performance diagnosis • Performance knowledge • Experiment design and specifications • Performance models • Performance metrics and evaluation rules • High level performance factors/design parameters (causes)

Implications • Where does the knowledge come from? • Extract from parallel computational models • Structural and operational characteristics • Reusable parallel design patterns • Associate computational models with performance models • Well-defined computation and communication pattern • Model examples • Single models: Master-worker, Pipeline,AMR, ... • Compositional models • Use model knowledge to diagnose performance problem • Engineer model knowledge • Integrate model knowledge with cause inference

Algorithmic-specific events Algorithmic performance modeling Model-based Generic Knowledge Generation Algorithm-specific Knowledge Extension Behavioral Modeling extend & event1 Abstract events instantiate event2 Performance Modeling refine Performance composition and coupling descriptions Metrics Definition Algorithmic-specific metrics extend & Model-based metrics instantiate extend Metric-driven inference Performance bug search and cause inference Inference Modeling Algorithm- specific factors extend Performancefactor library

Computational models Hercule Automatic Performance Diagnosis System model Parallel program model knowledge algorithm-spec. info Hercule perf. data knowledge base event recognizer measurement system inference engine metric evaluator experiment specifications inference rules diagnosis results explanations problems • Goals of automation, adaptability, extension, and reuse

Single Model Knowledge Engineered • Master-worker • Divide-and-conquer • Wavefront (2D pipeline) • Adaptive Mesh Refinement • Parallel Recursive Tree • Geometric Decomposition • Related publications • L. Li and A. D. Malony, "Model-based Performance Diagnosis of Master-worker Parallel Computations", in the proceedings of Europar 2006. • L. Li, A. D. Malony and K. Huck, "Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations", in the proceedings of HPCC 2006. • L. Li, A. D. Malony, "Knowledge Engineering for Automatic Parallel Performance Diagnosis", to appear in Concurrency and Computation: Practice and Experience.

Characteristics of Model Composition • Compositional model • Combine two or more models • Interaction changes individual model behaviors • Composition pattern affects performance • Model abstraction for describing composition • Computational component set {C1, C2, ..., Ck} • Relative control order F(C1, C2, ..., Ck) • Integrate component sets in a compositional model • Composition forms • Model nesting • Model restructuring • Different implications to performance knowledge engineering

Model Nesting • Formal representation • Two models: root F(C1, C2, ..., Ck) and child G(D1, D2, ..., Dl) • F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl)  • F(C1{G(D1, D2, ..., Dl)}, • C2{G(D1, D2, ..., Dl)}, • ... ... • Ck{G(D1, D2, ..., Dl)}) • where Ci{G(D1, D2, ..., Dl)} means Ci • implements the G model.

Model Nesting (contd.) • Examples • Iterative, multi-phase applications • FLASH, developed by DOE supported ASC/Alliances Center for Astrophysical Thermonuclear Flashes • Implications to performance diagnosis • Hierarchical model structure dictates analysis order • Refine problem discovery from root to child • Preserve performance features of individual models

Model Restructuring • Formal representation • Two models: F(C1, C2, ..., Ck) and G(D1, D2, ..., Dl) • F(C1, C2, ..., Ck) + G(D1, D2, ..., Dl)  • H(({C1F, ..., CkF}|{D1G, ..., DlG})+) • where {C1F, ..., CkF}|{D1G, ..., DlG} selects a • component CiF or DjG while preserving relative • component order in F and G. H is the new • function ruling all components.

Adapt Performance Knowledge to Composition • Objective: discover and interpret performance effects caused by model interaction • Model nesting • Behavioral modeling • Derive F(C1, C2, ..., Ck) fromsingle model behaviors • Replace affected root component with child model behaviors • Performance modeling and metric formulation • Unite overhead categories according to nesting hierarchy • Evaluate overheads according to the model hierarchy • Inference modeling • Represent inference process with an inference tree • Merge inference steps of participant models • Extend root model inferences with implementing child model inferences

Model Nesting Case Study - FLASH • FLASH • Parallel simulations in astrophysical hydrodynamics • Use Adaptive Mesh Refinement (AMR) to manage meshes • Use a Parallel Recursive Tree (PRT) to manage mesh data • Model nesting • Root AMR model • Child PRT model • AMR implements PRT data operations

Single Model Characteristics • AMR operations • AMR_Refinement – refine a mesh grid • AMR_Derefinement – coarsen a mesh grid • AMR_LoadBalancing – even out work load after refinement or derefinement • AMR_Guardcell – update guard cells at the boundary of every grid block with data from the neighbors • AMR_Prolong – prolong the solution to newly created leaf blocks after refinement • AMR_Restrict – restrict the solution up the block tree after derefinement • AMR_MeshRedistribution – mesh redistribution when balancing workload • PRT operations • PRT_comm_to_parent – communicate to parent processor • PRT_comm_to_child – communicate to child processor • PRT_comm_to_sibling – communicate to sibling processor • PRT_build_tree – initialize tree structure, or migrate part of the tree to another processor and rebuild the connection.

low speedup other_comm. communication computation AMR_comm. AMR_Prolong refine_freq. AMR_Workbalance AMR_Guardcell leaf_restrict AMR_Restrict AMR_Refine guardcell_size AMR_Derefine data_fetch parent_prolong check_neighbor refine_levels refine_levels refine_levels physical_block_contiguity physical_block_contiguity physical_block_contiguity block_weight_assign_method physical_block_contiguity physical_block_contiguity physical_block_contiguity data_contiguity_in_cache balance_strategy balance_strategy migrate_blocks calculate_blocks_weight inform_neighbor sort_blocks check_refine rebuild_block_connection AMR Inference Tree : symptoms : intermediate observations : performance factors : inference direction ... ... ... ... ... ... ... ...

low speedup communication computation lookup_parent sibling_comm parent_comm child_comm link_parent lookup_sib. rebuild_tree lookup_child build_tree data_transfer data_transfer link_sib. rebuild_ connection migrate_subtree link_child data_transfer init._tree tree_node_contiguity tree_node_contiguity tree_node_contiguity tree_depth freq. migrate_strategy tree_node_contiguity tree_depth tree_depth data_contiguity data_contiguity tree_node_contiguity tree_depth data_contiguity migrate_strategy migrate_strategy other_ comm fetch_freq. fetch_freq. fetch_freq. PRT Inference Tree : symptoms : intermediate observations : performance factors : inference direction 1 2 ... ... 3 5 4

low speedup communication computation AMR_comm. others AMR_Guardcell refine_freq. AMR_Workbalance leaf_restrict AMR_Refine inform_neighbor guardcell_size refine_levels data_fetch check_neighbor parent_prolong refine_levels refine_levels physical_block_contiguity physical_block_contiguity block_weight_assign_method physical_block_contiguity physical_block_contiguity physical_block_contiguity physical_block_contiguity data_contiguity_in_cache balance_strategy balance_strategy sort_blocks calculate_blocks_weight rebuild_block_connection migrate_blocks check_refine FLASH Inference Tree A : refine perf. problem search following subtrees of PRT that are relevant to A. The No. represent corresponding subtrees in PRT. No. ... ... 1,3 1,2,3 1,2,3 1,2,3 3 4 1,2,3 1,2,3 5

Experiment with FLASH v3.0 • Sedov explosion simulation in FLASH3 • Test platform • IBM pSeries 690 SMP cluster with 8 processors • Execution profiles of a problematic run (Paraprof view)

Diagnosis Results Output (Step 1&2) Begin diagnosing ... ======================================================== Begin diagnosing AMR program ... ... Level 1 experiment -- collect performance profiles with respect to computation and communication. ______________________________________________________________ do experiment 1... ... Communication accounts for 80.70% of run time. Communication cost of the run degrades performance. ======================================================== • Step 1: find performance symptom • Step 2: look at root AMR model performance ========================================================= Level 2 experiment -- collect performance profiles with respect to AMR refine, derefine, guardcell-fill, prolong, and workload-balance. ________________________________________________________________ do experiment 2... ... Processes spent 4.35% of communication time in checking refinement, 2.22% in refinement, 13.83% in checking derefinement (coarsening), 1.43% in derefinement, 49.44% in guardcell filling, 3.44% in prolongating data, 9.43% in dealing with work balancing, =========================================================

Step 3: Interpret Expensive guardcell_filling with PRT Performance ==================================================================== Level 3 experiment for diagnosing grid guardcell-filling related problems -- collect performance event trace with respect to restriction, intra-level and inter-level commu. associated with the grid block tree. ___________________________________________________________________________________ do experiment 3... ... Among the guardcell-filling communication, 53.01% is spent restricting the solution up the block tree, 8.27% is spent in building tree connections required by guardcell-filling (updating the neighbor list in terms of morton order), and 38.71% in transferring guardcell data among grid blocks. ___________________________________________________________________________________ The restriction communication time consists of 94.77% in transferring physical data among grid blocks, and 5.23% in building tree connections. Among the restriction communication, 92.26% is spent in collective communications. Looking at the performance of data transfer in restrictions from the PRT perspective, remote fetch parent data comprises 0.0%, remote fetch sibling comprises 0.0%, and remote fetch child comprises 100%. Improving block contiguity at the inter-level of the PRT will reduce restriction data communication. __________________________________________________________________________________ Among the guardcell data transfer, 65.78% is spent in collective communications. Looking at the performance of guardcell data transfer from the PRT perspective, remote fetch parent data comprises 3.42%, remote fetch sibling comprises 85.93%, and remote fetch child comprises 10.64%. Improving block contiguity at the intra-level of the PRT will reduce guardcell data communication. ==================================================================== AMR model performance PRT operation perf. in AMR_Restrict PRT operation perf. in transferring guardcell data

Conclusion and Future Directions • Model-based performance diagnosis approach • Provide performance feedbacks at a high level of abstraction • Support automatic problem discovery and interpretation • Enable novice programmers to use established expertises • Compositional model diagnosis • Adapt knowledge engineering approach to model integration • Disentangle cross-model performance effects • Enhance applicability of model-based approach • Future directions • Automate performance knowledge adaptation • Algorithmic knowledge, compositional model knowledge • Incorporate system utilization model • Reveal interplay between programming model and system utilization • Explain performance with the model-system relationship

Automatic Performance Diagnosis of Parallel Computations with Compositional Models

Automatic Performance Diagnosis of Parallel Computations with Compositional Models

Presentation Transcript

Parallel and High Performance Computing

High Performance Parallel Programming

High Performance Parallel Programming

Parallel Performance

Performance of Parallel Programs

Predicting Parallel Performance

Parallel Performance

Improving Parallel Performance

Performance Engineering of Parallel Applications

Performance-Robust Parallel I/O

High Performance Parallel Programming

2.4 Parallel Performance Enhancements

TAU Parallel Performance System

High Performance Parallel Programming

Performance: Parallel Each

Parallel Performance Wizard: Introduction

CMAQ Parallel Performance

Performance Analysis with Parallel Performance Wizard

Parallel Performance Diagnosis

Parallel Performance

TAU Parallel Performance System

TAU Parallel Performance System