Integrated MPI/OpenMP Performance Analysis

Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

Outline • Why integrated MPI/OpenMP programming? • A performance tool for MPI/OpenMP programming (Phase 1) • Integrated performance analysis capability for ASCI Apps (Phase 2)

Why Integrate MPI and OpenMP? • Hardware trends • Simple example – How it is done now? • An FEA Example • ASCI Examples

Example recently LLNL ASCI clusters Parallel Capacity Resource (PCR) cluster Three clusters totaling 472 Pentium 4s; the largest with 252 Theoretical peak 857 gigaFLOP/s, Linux NetworX via SGI Federal HPCWire 8/31/01 Parallel global file system cluster Total 48 Pentium 4 processors 1,024 clients/servers Deliver I/O rates of over 32 GB/s Fail-over and global lock manager Linux open source NetworX via SGI Federal HPCWire 8/31/01 Parallel Hardware Keeps Coming

OpenMP Performance tools OpenMP MPI/OpenMP Performance tools Code Performance Debuggers, IDEs Effort Parallelism Performance Analysis MPI

Cost Effective Parallelism Long Term • Wealth of parallelism experience single person codes to large team

ASCI Ultrascale Tools Project • Pathforward project • RTS – Parallel System Performance • Ten Goals in three areas – • Scalability– Work with 10,000+ Processors • Integration – How about Hardware Monitors, Object Oriented, and Runtime Environment? • Ease of Use – Dynamic Instrumentation and Be Prescriptive, not just Data Management

Guide – Source Instrumentation Vampirtrace – MPI/OpenMP Instrumentation Vampir – MPI Analysis GuideView – OpenMP Analysis Architecture for Ultrascale Performance Application Source  Guide Guidetrace Library Object Files  Vampirtrace Library Executable TraceFile  Vampir  GuideView

Phase One Goal – Integrated MPI/OpenMP • Phase One Goals – • Integrated MPI OpenMP Tracing • Mode most compatible with ASCI Systems • Whole Program Profiling • Integrate program profile with parallelism • Increased Scalability of Performance Analysis • 1000 processors

Vampir – Integrated MPI/OpenMP SWEEP3D run on 4 MPI tasks with 4 OpenMP Threads each Threaded activity during OpenMP region Timeline shows OpenMP regions with glyph

GuideView – Integrated MPI/OpenMP & Profile SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads All OpenMP regions for process summarized to one bar Highlight (Red arrow) shows speedup curve for that set of threads Thread view shows balance between MPI tasks and threads

GuideView – Integrated MPI/OpenMP & Profile Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive Sorting and filtering bring large amounts of information to manageable level

Compilation of OpenMP Automatic subroutine entry and exit instrumentation – Fortran C/C++ New compiler options –WGtrace -- link with the Vampirtrace WGprof -- subroutine entry/exit profiling – WGprof_leafprune minimum size of procedures to retain in profile Guide –Compiler Workhorse

Support for pruning of short routines Vampirtrace –Profiling All events that have not been pruned could now be written to the tracefile. This tree will be pruned. ROUTINE X will be marked as having calltree info summarized. ROUTINE X ENTRY ROUTINE Y ENTRY > Δt < Δt ROUTINE Y EXIT ROUTINE Z may still be < Δt so cannot yet be written. ROUTINE Z ENTRY ˚ ˚ ˚

Scalability on Phase One • Timeline scaling to 256 Tasks/Nodes • Gathering of tasks in node into group • Filtering by nodes • Expand each node • Message statistics by nodes

Phase Two – Integrating Capabilities for ASCI Apps • Phase Two Goals – • Deployment to other platforms – • Compaq, CPlant, SGI • Thread-Safety • Scalability – • Grouping • Statistical Analysis • Integrated GuideView • Hardware performance monitors • Dynamic control of instrumentation • Environmental awareness

Collect data from each thread – Thread-safe Vampirtrace library Per thread profiling data Previous release, only master thread logged data Improves accuracy of data Value to users – Enhances integration between MPI and OpenMP Enhances visibility into functional balance between threads Thread Safety

Up to end of FY00 Fixed hierarchy levels (system, nodes, CPUs) Fixed grouping of processes Eg, Impossible to reflect communicators Need more levels Threads are a fourth group Systems with deeper hierarchies (30T) Reduce number of on-screen entities for scalability Scalability: Grouping Whole system Node 1 Node n Quadboard T_1 T_p CPU 1 CPU c t_c t_1

Default Grouping By Nodes By Processes By Master Threads All Threads Can be changed in configuration file Default Grouping

Filter processes dialog Select groups combo-box Display of groups By aggregation By representative Grouping applies to “Timeline bars” Counter streams Scalability: Grouping

Scalability by Grouping Parallelism display showing all threads Parallelism display showing only master threads alternating between MPI and OpenMP parallelism

Collects basic statistics at runtime Saves statistics in an ASCII-file View statistics your favorite spreadsheet ... Reduced overhead compared to tracing Tracefile(big) Statistical Information Gathering Parallel Executable Statsfile(small) Perl filter Excel, ...

Can work independent of tracing Significantly lower overhead (memory, runtime) Restriction: for the whole application run ... Statistical Information Gathering

Statistical Information Gathering

Creating an extension API in Vampir insert menu items include new displays have access to trace data & statistics Vampir menus GuideView Integrated Inside Vampir Vampir GUI engine invoke New GuideView control display access Trace data(in memory) Motif graphics library

Goals – Improve MPI/OpenMP integration Improve scalability Integrate look and feel Works like old GuideView! Load time – Fast! New GuideView Whole Program View

Looks like old Region view turned on the side! Scalability test 16 MPI tasks 16 OpenMP threads 300 Parallel regions New GuideView Region View

User can call HPM API in the source code User can define events in Config file for Guide instrumentation HPM counter events are also logged from Guidetrace and Vampirtrace library Underlying HPM library is PAPI Hardware Performance Monitors  Application Source  Config File Guide Guidetrace  Object Files Vampirtrace  Executable PAPI TraceFile Vampir GuideView

Standardizes names across platforms Users define counter sets User could instrument by-hand -- But better, Counters are instrumented at OpenMP and subrs int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”); VT_symdef(other, “OTHER”, “USERSTATES”); VT_change_hpm(set_id); VT_begin(outer); foo(); VT_begin(inner); bar(); VT_end(inner); foo(); VT_end(outer); } Create a new event set to measure L1 & L2 data cache misses. PAPI – Hardware Performance Monitors Can’t support unsup-ported counters Activate the event set Collect the events over two user-defined intervals

Hardware Performance Example MPI tasks on timeline Or, per MPI task activity correlated in same window Floating pt instructions correlated but in different window

Hardware Performance Can Be Rich 4 x 4 SWEEP3D run showing L1 Data Cache Miss Cycles Stalled Waiting for Memory Accesses

Hardware Performance in GuideView You can see the HPM data on all GuideView windows L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view

In this menu you can arithmetically combine measured counters into derived counters Derived Hardware Counters Vampir and GuideView displays present derived counters

Select rusage information like HPMs Environmental Counters • Data appears in Vampir and GuideView like HPM data • Time-varying OS counters – • Config variable sets sampling frequency • Difficult to attribute to source code precisely

Type 1: Collects IBM MPI information Treated as static (one time) event in tracefile Over 50 parameters Environmental Awareness

In source, User puts VT_confsync() calls At runtime, TotalView is attachedand breakpoint is inserted From process #0, user adjusts several instrumentation settings VTconfigchanged flag is set, breakpoint is exited, Dynamic Control of Instrumentation  Application Source Guide TotalView  Object Files Vampirtrace Library  Executable  TraceFile Vampir Tracefile reflects change after next VT_confsync() GuideView

Dynamic Control of Instrumentation

Structured Trace FilesFrames Manage Scalability A Section of the Timeline A Set of Processors Messages or Collectives OpenMP Regions Instances of a subroutine

Frames are defined In the source code – int VT framedef( char name, unsigned int type mask, int * frame handle ) int VT framestart( int frame handle ) int VT framestop( int frame handle ) Type_mask defines the types of data collected – VT FUNCTION VT REGION VT PAR REGION VT OPENMP VT COUNTER VT MESSAGE VT COLL OP VT COMMUNICATION VT ALL Analyze time frames will be available Structured Trace Files Consist of Frames

Structured Trace FilesRapid Access By Frames Index File Frame Frame Frame Frame 2) Vampir Thumbnail Displays Represent Frames 1) Structured Tracefile 3) Selecting Thumbnails Displays Frames in Vampir

How to avoid SOOX – Instrument with API (Scalability Object Oriented eXplosion) C++ templates, classes make it much easier Can be used with or without source Use TAU model Object Oriented Performance Analysis ImZ VT Activity/ InformerMappings ImY ImX Informers I_C I_B I_A I_D ImQ Events MPI_Recv MPI_Finalize MPI_Send Func A Func X Func Y Func Z Func Init

class Matrix { public: InformerMapping im; Matrix(int rows, int columns) { if (rows * columns > 500) im.Rename(“LargeMatrix”); else im.Rename(“Matrix”); } void invert () { Informer(im, “invert”, 12, 15, “Example.C”); #pragma omp parallel { .... } MPI_send(...); } void compl () { Informer(im, “typeid(…)” ); .... } }; void main(int argc, char **argv) { Matrix A(10,10),B(512,512),C(1000,1000); // line 1 B.im.Rename(“MediumMatrix”); // line 2 A.invert(); // line 3 B.compl(); // line 4 C.invert(); // line 5 } Example of OO Informers Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin), and C (mapped to “LargeMatrix” bin) Remap B to “MediumMatrix” bin A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void compl(void)”) in MediumMatrix bin C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in LargeMatrix bin

Vampir OO Timeline Shows Informer Bins InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName

Vampir OO Profile Shows Informer Bins Time in Classes: Queens MPI Time in Class: Queens

OO GuideView Shows Regions in Bins Time and counter data per thread by Bin

Parallel Performance Engineering • ASCI Ultrascale Performance Tools • Scalability • Integration • Ease of Use • Read about what was presented • ftp://ftp.kai.com/private/Lab_notes_2001.doc.gz • Contact: seon.w.kim@intel.com • Thank you for your attention!

Integrated MPI/OpenMP Performance Analysis

Integrated MPI/OpenMP Performance Analysis

Presentation Transcript

The Analysis and Interpretation of Water-Oil Ratio Performance in Petroleum Reservoirs

Performance Analysis Tools

Standards for Integrated Governance, Risk and Compliance Management

Performance Analysis

Introductions to Parallel Programming Using OpenMP

VADE – RULA Integration

OpenMP

WITS World Integrated Trade Solution Training

Introduction to Industry Analysis

Module 6: Scenario development and analysis

Managing Performance

Job Analysis

Collective Forest Tenure Reform in China: Outcomes and Analysis of Performance 集体林权改革：结果和绩效分析

POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

Chapter 4- Performance Engineering Methodology

CSE 567M Computer Systems Analysis

Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Parallel Programming in C with MPI and OpenMP

Chapter 8 Performance Analysis of Alpha-Beta Pruning