SCALASCA

SCALASCA A Parallel Approach for Scalable, Trace-Based Performance Analysis

Outline • Introduction • Parallel Trace-Based Performance Analysis • Early Results • Conclusion • Future Work

Event Tracing • Post-mortem analysis of program behavior • Recording of time-stamped events at runtime • Entering/leaving a function • Sending/receiving a message • Collective operations • Synchronization • Accomplished by source-code instrumentation • High level of detail

Discovery of wait states zoom in Time-line visualization • „Serial“ human client

Low-level event trace Problem Call tree High-level profile Reduction  System Automatic Off-Line Trace Analysis • Idea: • Automatic search for patterns of inefficient behavior • Quantification of significance • Data distillation • Guaranteed to cover the entire trace

KOJAK Project • Software package for automatic performance analysis of parallel applications • Message passing and multi-threading (MPI, OpenMP, SHMEM) • Parallel performance • CPU and memory performance • Collaborative research project between • Forschungszentrum Jülich, Germany • University of Tennessee, USA • URLshttp://www.fz-juelich.de/zam/kojak/ http://icl.cs.utk.edu/kojak/

Local Definition & Trace Files Execution on Parallel Machine Analysis Process Instrumentation Automatic Multilevel Instrumentation Source Code Executable Measurement Analysis Global CUBE File Global Definition & Trace Files Merge Sequential Analyzer (EXPERT)

Which type of problem? Where in the source code? Which call path? Which process / thread ? Analysis Report

Long traces Wide traces Problem size large t small t Granularity / event rate high Number of processes t low t Temporal coverage width full t disabled partial t t Scalability Problems

SCALASCA • Follow-up project • Funded by German Helmholtz Association • Specifically addresses scalability • Basic idea: Parallelization of analysis • Current focus: Single-threaded MPI-1 applications • URL http://www.scalasca.org/

Local Definition & Trace Files Execution on Parallel Machine Sequential Analysis Process Instrumentation Automatic Multilevel Instrumentation Source Code Executable Measurement Analysis Global CUBE File Global Definition & Trace Files Merge Sequential Analyzer (EXPERT)

Local Definition & Trace Files Execution on Parallel Machine Parallel Analyzer Parallel Analysis Process Instrumentation Automatic Multilevel Instrumentation Source code Executable Measurement Analysis Global CUBE File

Local Definition Files ID Mapping Tables Local Trace Files Local Results Execution on Parallel Machine Parallel Analyzer Current Prototype Instrumentation Automatic Multilevel Instrumentation Source code Executable Measurement Analysis Unification Global CUBE File Combine

Parallel Pattern Analysis • Analyze separate local trace files in parallel • Exploits distributed memory • Often allows keeping whole trace in main memory • Parallel Replay of target application‘s communication behavior • Analyze communication with an operation of same type • Parallel traversal of event streams • Exchange of data at synchronization points of target application

waiting time Example: Late Sender • Sequential approach (EXPERT) • Scan the trace in sequential order • Watch out for receive event • Use links to access other constituents • Calculate waiting time • New parallel approach • Each process identifies local constituents • Sender sends local constituents to receiver • Receiver calculates waiting time

Example: Wait at NN • Waiting time due to inherentsynchronization in N-to-Noperations (e.g. MPI_Allreduce) • Parallel analysis: • Identify local exit event of collectiveN-to-N operation • Determine timestamp of latest enter event (A single call to MPI_Allreduce is sufficient!) • Calculate local waiting time time

Experimental Evaluation • ASC SMG2000 benchmark • Semi-coarsening multigrid solver • Fixed 646432 problem size per process, 5 solver iterations • Weak scaling behavior • PEPC-B • Parallel tree code for computing long-range forces in N-body problems • Fixed overall problem size • Strong scaling behavior

Test Platform Jülicher BlueGene/L (JuBL) • 8 Racks with 8192 dual-core nodes • 288 I/O nodes • p720 service and login nodes (81.6 GHz Power5 CPUs each)

Results: SMG2000

Results: PEPC-B

Conclusion • Scalability can be addressed by parallelization • Process local trace files in parallel • Replay target applications communication behavior • Promising results with prototype implementation

Future Work • Integrate & parallelize remaining sequential parts • Resolve file I/O issues • Address long traces • Selective tracing • Sophisticated data structures (e.g. cCCGs) • Extend to other programming paradigms • OpenMP • MPI-2

Thank you! Questions?

SCALASCA

SCALASCA

Presentation Transcript

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca , TAU, and Vampir

Automatic trace analysis with Scalasca

KOJAK and SCALASCA

KOJAK and SCALASCA