Paradyn Evaluation Report

Paradyn Evaluation Report Adam Leko, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

Basic Information • Name: Paradyn • Developer: University of Wisconsin-Madison • Current versions: • Paradyn: 4.1.1 • DynInst: 4.1.1 • KernInst: 2.0.1 • Website:http://www.paradyn.org/index.html • Contact: Matthew Legendre

Bandwidth Time What is Paradyn? • A performance analysis tool (PAT) for sequential and parallel programs • Uses dynamic binary instrumentation to record program metrics (may use unmodified executables) • Visualizations • Metric-focus grids (right, top) • Rows: performance metrics • Columns: resources to collect a performance from • Metrics can be reported as current value, statistics (min/max/average), or time-histograms (right, bottom) • Performance consultant • Automated search to identify bottlenecks in program • Uses W3 model – where, when, why • A generic project that includes tools related to performance analysis • Paradyn PAT • DynInst: a dynamic binary instrumentation library • KernInst: a dynamic instrumentation library for instrumenting running operating system (OS) kernels • Not very useful for a PAT unless PAT needs to be applied to an OS kernel • MRNet: a high-performance communication library supporting master-slave software architectures • “Multicast/Reduction Network” • Not immediately useful for the design phase of our PAT Example metric-focus grid Visual representation of example time-histogram

General Paradyn Architecture • Four main components • User interface (green) • Visualization (red) • Performance consultant (purple) • Instrumentation (blue) • Thick circles represent running processes, dotted circles represent threads within a single process • Will present each using “bottom up” approach Image courtesy [1]

Part 1: Instrumentation

Instrumentation Overview • Paradyn terminology: points, primitives, and predicates • Points – places where instrumentation code can be placed • Supported points: procedure entry, procedure exit, individual call statements • Primitives – simple operations that change the value of a counter or timer • Predicates – boolean expressions that guard execution of primitives • Using predicates and primitives, points in a program may be instrumented • Predicates and primitives are controlled via PCL and MDL (discussed later) • Paradyn uses dynamic instrumentation to record performance data at points • Paradyn attaches its performance daemons to a running process or starts a new process using an unmodified binary • Instrumentation workflow: • User or performance consultant requests a metric-focus from Paradyn • Data manager in Paradyn uses Remote Procedure Call (RPC) to communicate with remote processes asking them to start instrumentation for a specific metric focus • RPC allows heterogeneity in runtime environment • Metric manager receives instrumentation request and turns that into an abstract, machine-independent request • Instrumentation manager inserts code into executable corresponding to machine-independent abstraction • Executable is stopped • Code is inserted • Executable resumes running • Instrumented data is periodically sampled by the metric manager and sent back to the data manager

Binary Instrumentation • Binary instrumentation accomplished by inserting base trampolines for each instrumentation point • Base trampolines handle storing current state of program so instrumentations do not affect execution • In some architectures, only registers that are used are saved (if can be inferred from machine calling convention) • Mini trampolines are the machine-specific realizations of predicates and primitives • One base trampoline may handle many mini-trampolines, but a base trampoline is needed for every instrumentation point • Basic flow of trampoline shown in right, top • Mini trampoline assembly code for SPARC machine shown in right, bottom • Binary instrumentation difficult! • Have to deal with • Compiler optimizations • Branch delay slots • Different sizes of instructions for x86 (may increase the number of instructions that have to be relocated) • Creating and inserting mini trampolines somewhere in program (at end?) • Limited-range jumps may complicate this • Luckily, DynInst library available separately for use in other applications • Paradyn’s instrumentation cost <= 80 clock cycles per base trampoline [2] Trampoline flow (courtesy [2]) Mini trampoline (courtesy [2])

PCL & MDL • Paradyn provides a TCL-like language to configure and add metrics without recompiling or modifying Paradyn • Stored in paradyn.rc, user may use their own version (.paradynrc) • PCL – Paradyn Control Language • Controls available daemons (MPI, sequential, etc) • Can add processes automatically at startup (which programs to record performance data for) • Can customize Paradyn options (colors and other “tunable constants”) • Can add visualizations (described later) • Can add metrics via MDL • MDL – Metric Description Language • Sublanguage of PCL • Describe metrics • Types provided: counters and timers • Can specify constraints for each metric that limit how they can be used/what they can be used with • May be exclusive or inclusive (include a point’s calls to other procedures or just include a point’s cost by itself, excluding time spent in other procedures called from this point) • Language not Turing complete: no looping construct provided • Example counter shown right Counter MDL code (courtesy [2])

Paradyn Overhead • Instrumentation very low for most test programs for 5 metrics on all functions • Communication metrics • Number of messages sent • Number of point-to-point messages • Number of collective messages • I/O bytes • CPU metric • CPU utilization • Instrumenting CAMEL’s main routine had 800% overhead • Instrumenting a function also instruments its call sites • main routine had many small function calls • Performance consultant (discussed later) added a large amount of overhead to most programs during searches

Part 2: Performance Consultant

Performance Consultant (PC) Overview • PC performs an automated search on the program • Identifies bottlenecks in programs • Uses W3 search (described next slide) • Search is guided, based on program’s call graph [3] • Iterative method that tests hypothesis against sections of code • Starts with main and examines subroutine calls • “Drills down” and examines subroutines based on frequency of they are called • Call graph search method was successfully applied to several large programs containing thousands of lines of code • However, method can miss functions called by more than one parent function whose individual parent functions do not appear as “problem functions” • Call graph automatically generated from executable’s symbol table • Example PC run shown at top right, corresponding call graph for application shown at bottom right Example PC run Call graph used in PC run

W3 Model: Why, Where, When • Paradyn’s goal: “… to assist the user in locating performance problems in a program; a performance problem is a part of the program that contributes a significant amount of time to its execution” • W3 model attempts to answer: • Why is the program performing poorly? • Where is the program performing poorly? • When is the application performing poorly? • Performance consultant shows why and where axes graphically to the user (see right) • Yellow line: why axis refinement • Purple line: where axis refinement W3 refinements (blue=true, pink=false)

W3 Model: Why, Where, When (2) • Why axis • Paradyn applies hypotheses to code • ExcessiveSyncWaitingTime? • CPUBound? • ExcessiveIOBlockingTime? • TooManySmallIOOps? • Each hypothesis is represented by a tunable predicate • E.g., CPUBound := CPUTime > 20% • After a hypothesis is determined to be false, no more searching is done for that type of bottleneck • Where axis • Once a hypothesis is tested to be true (why refinement), • An automated search is started to determine where the problem lies • Each subroutine is examined to see if the hypothesis is also true (where refinement) • The program’s call graph is used to guide search of subroutines • Where axis is iteratively searched until the deepest node of the call graph is reached that the hypothesis tests true for • When axis • Indirectly supported through the use of “phases” • Phases are defined by the user • Phases represent specific time intervals in a program’s execution • When axis refinement relies on the user’s interaction • While axis refinements are made, performance consultant automatically requests instrumentation • Frequency of instrumentation and a limit on number of concurrent instrumentations can be set by the user W3 refinements (blue=true, pink=false)

Bottleneck Identification Test Suite • Testing metric: what did Performance Consultant tell us? • Programs correctness not affected by instrumentation  • CAMEL: PASSED • Identified program as CPU-bound • However, Performance Consultant added much overhead and resulted in a misdetection on the where axis • LU: TOSS-UP • Identified as excessive sync time bottleneck • Not further resolved to too many small messages, only was able to track down to the ssor.f source code file • Big messages: PASSED • Identified excessive sync time @ Grecv_message function • Diffuse procedure: FAILED • Identified excessive sync time at MPI_Barrier, but did not localize to bottleneck procedure • Missed picking up on diffuse CPU-bound behavior

Bottleneck Identification Test Suite (2) • Hot procedure: PASSED • Correctly identified CPU-bound bottleneck procedure • Due to excessive instrumentation, Performance Consultant overhead slightly misdiagnosed where location • Attributed to all nodes except one when all nodes exhibit the problem • Intensive server: TOSS-UP • Identified excessive sync waiting time on Grecv_message from main • However, due to lack of trace view, it would be difficult/impossible to see all threads waiting on the master thread • Ping-pong: PASSED • Identified excessive sync waiting time on Grecv_message • Random barrier: TOSS-UP • Identified excessive sync waiting time on barrier in main • No trace view means it would be nearly impossible to see randomness of which node was (inconsistently) taking more time

Bottleneck Identification Test Suite (3) • Small messages: TOSS-UP • Identified excessive sync waiting time on Gsendmessage in main • Did not localize to a particular node, though • System time: FAILED • Performance Consultant failed to instrument code • Possibly due to OS being too busy with user code to handle dynamic binary instrumentation • Wrong order: TOSS-UP • Identified excessive sync waiting time on messages on main • Would best be seen by a trace, but classification here was different than other communication-based bottlenecks

Part 3: Visualizations

Terrain visualization Histogram visualization Visualizations Overview • Paradyn supports several types of built-in visualizations (visis) for metrics • Bar charts • Histograms (right, top) • Table (text representation, can show current/max/min values for each metric) • “Terrain” – 3D histogram (see right, bottom) • Axes are time, metric, location • Visualizations may handle multiple metrics at once • Visualizations are implemented as separate processes • Callback functions are used to provide continuous data to visualization programs • Users may add custom visualizations • Paradyn provides a simple library and RPC interface • Configured to show up in interface via PCL files (paradyn.rc, .paradynrc)

Terrain visualization Histogram visualization Visualizations Overview (2) • When a user creates a visualization, • Paradyn automatically instruments running program accordingly • Visualization continues until user closes it • After closing, Paradyn automatically removes instrumented code • Histograms are stored using a fixed-size data structure • Metric values sorted into “buckets” • When buckets fill, data is reorganized and number of buckets doubles (though keeping structure of a fixed size) • As execution time increases, sampling rate decreases logarithmically to keep data sizes small

Part 4: User Interface

User Interface Overview • Current interface uses Tcl/Tk for graphics (right) • Multiple windows for everything • Makes for a cluttered interface • Tcl/Tk provides a useable but crude-looking interface Example Paradyn session

Paradyn Bugs • Can’t detect end of MPI program run (Paradyn will crash unless you start over from scratch) • Program crashes almost every time shortly after MPI program completes • Buggy startup code (starting a new process twice gives errors; program must be restarted) • Doesn’t work with code compiled with profiling information (gcc –g), see error dialog to right • “Can’t read .shstrtabsection” • Pausing execution and adding a visualization crashes Paradyn (program continues execution while Paradyn thinks it is still paused) • Often leaves zombie children processes, even on error-free runs • Paradyn left unkillable processes hanging around after crashes on etas • killall -9 could not get rid of them

Paradyn Complaints • Slow startup (~5 seconds for each MPI node on etas) • Performance consultant takes a while to identify bottlenecks • Although, search is entirely automated • However, only seems to pick up on code that exhibits obvious bottlenecks • Cluttered and confusing interface • Why is there separate windows for the callgraph and where axes? • Many bugs, although most are handled by displaying a nice dialog box • However, some bugs necessitate a Paradyn restart • Function list on “where axis” dialog box contains a huge number of functions for MPI programs (~100+, includes MPI functions in list which makes it hard to single out your application’s functions) • Phase function difficult to use • Should be easier to define phases, or base phases on subroutine entry/exit points • No “stop process” button!

Paradyn General Comments • W3 search hypotheses and threshold functions seem overly simplistic (-) • Doesn’t seem to work well on code that alternates quickly between communication and computation • Small amount of hypotheses, perhaps due to large cost of evaluating each one? • Cutoff values for hypotheses seems arbitrary • Are tunable, but is a fixed cutoff appropriate? • Performance consultant was not able to detect/classify a sleep(1) statement inserted for a single MPI process • Should have labeled the receiving node as ExcessiveSyncWaitingTime, but did not label the process at all • Quick changes between computation and communication may have fooled it, perhaps adjusting thresholds would have helped; • How would you know which thresholds to change? • How useful is the information provided by the W3 search? • Seems to only be able to pick out obvious things • Says what is the problem, but does it offer insight on how to fix it? • Overhead introduced by dynamic instrumentation seems very tiny (+++) • < 1% for 16 metrics being collected on a 16-node MPI application • However, overhead can increase dramatically for functions that call other (lightweight) functions many times over

Paradyn General Comments (2) • Platform support (--) • Paradyn: No support for 64-bit applications or Cray platforms! • DynInst: No support for 64-bit Opteron or Cray platforms! (Support for Itanium is provided though) • Dependence on DynInst combined with difficulty in porting DynInst to new platforms a potential problem • Adding and removing instrumentations is fast and works well (+++) • DynInst seems to be much more stable than Paradyn, minus the parsing bugs for executables compiled with gcc -g • Adding instrumentations to code usually takes one second or less • Helps reduce the measure stage of the “measure-modify” approach • However, time needed to start programs significantly increased, especially with many processes (-) • However, extra delays incurred during instrumentation affect the ability to gather traces of program execution • Is dynamic instrumentation necessary? • Things are greatly simplified when dynamic binary-level instrumentation is not implemented • Is it worth the added cost and complexity? • Fairly complex piece of software, takes a while to learn how to use effectively, even with tutorials (-) • This, along with its complicated installation procedure, may discourage its use • Though documentation is pretty good • PCL and MDL allow configuration and addition of user-defined metrics (+++)

Feasibility for UPC & SHMEM • In order to add support for UPC & SHMEM: • Need to create Paradyn daemons for UPC and SHMEM codes • This may be very difficult, since Paradyn daemons need to handle instrumentation • For UPC, how should communication be handled? • Instrument runtime libraries? • Which runtimes should be supported? • Is it feasible to support all runtimes of interest? • What about proprietary UPC languages and runtimes? • This could be an insurmountable problem • Paradyn has been around for a long time • Is there a lot of crufty code in the source code that is left alone because no one understands it? • Is the current user interface (Tcl/Tk) acceptable? • Also: • Would need to port DynInst to targeted architectures • This may be problematic for architectures with no publicly available information on executable file formats/etc • Should include performance metrics as recorded by PAPI • MDL should help, but • Will MDL present too large of an overhead for the level of granularity needed by PAPI? • Is a lack of tracing ability acceptable? What if more details are needed?

Evaluation (1) • Available metrics: 5/5 • Many built in • Number of CPUs, number of active threads, CPU and inclusive CPU time • Function calls to and by • Synchronization (# operations, wait time, inclusive wait time) • Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received) • I/O (# operations, wait time, inclusive wait time, total bytes) • Can add more using MDL and PCL • Cost: free 5/5 • Documentation quality: 4/5 • Tutorial for using sequential and MPI programs with Paradyn • Well-written manuals • Programming guides included for DynInst, visualization library, and MDL • Extensibility: 2/5 • Creating a SHMEM daemon wouldn’t take a lot of work • Creating a UPC daemon will be problematic for proprietary runtimes • Depends on DynInst, and porting DynInst to a new platform may take an immense amount of work • Filtering and aggregation: 3/5 • Only supports rudimentary aggregation on metrics (min, max, averages) • Hardware support: 2/5 • No support for Opteron, Itanium, or Cray architectures in Paradyn • DynInst supports Itanium • Porting DynInst (which Paradyn depends on) would be very difficult • Heterogeneity support: 5/5 • Authors claim Paradyn supports heterogeneity due to use of RPC interfaces • Not directly supported by user interface for MPI programs, so cannot test

Evaluation (2) • Installation: 2/5 • Binaries are easy to find • http://www.paradyn.org/html/paradyn4.1-software.html • Compiling from source extremely difficult and error-prone • Relies on specific versions of libdwarf (Linux only) and Tcl/Tk (all), which complicates the installation if your distribution or OS uses incompatible versions • Installation time: approximately 2-3 hours for a shared environment • Need to create scripts that set about 6 environment variables before program will run correctly • Interoperability: 1/5 • Paradyn can save output in simple, documented format, but usefulness of data unknown • No detailed, trace-like information can be provided as it is not collected • Dynamic instrumentation interferes with tracing due to timing perturbations • Learning curve: 2/5 • Difficult, complex program with many parts • Took approximately 1 week to get comfortable with the program • Manuals and tutorials are very helpful • Manual overhead: 5/5 • No modification needed for executables • Measurement accuracy: 4/5 • Dynamic instrumentation incurs very low overhead (~80 cycles for trampoline overhead) • Time-histogram loses accuracy as time goes on due to fixed size • Multiple execution: 0/5 (not supported)

Evaluation (3) • Multiple views and analyses: 5/5 • Several visualization types are supported for all metrics • Histograms, bar charts, 3D histograms, tables, summary tables (min/max/average) • Users can add new visualization programs as desired using Paradyn’s RPC interface and visualization library • Call graphs and where axis give user a hierarchical view of their code • Default visualizations support zooming and panning • Performance bottleneck identification: 2.5/5 • Performance consultant can help identify “obvious” bottlenecks automatically • Due to limited search space (only 4 types of bottlenecks), bottleneck identification is limited • Tweaking thresholds used for search may be necessary to identify bottlenecks • Profiling/tracing support: 2/5 • Uses a “hybrid” approach of sampling and tracing • Detailed tracing information cannot be logged for later analysis • Cannot create a trace file of when MPI functions were called (e.g., what you’d need for Jumpshot) • Paradyn daemons report values back to main Paradyn process at time intervals • Main Paradyn process only has an approximation of “real” values for metrics at any given time • However, values recorded by main Paradyn process can be exported (uses a simple documented format) • Response time: 5/5 • Dynamic instrumentation allows arbitrarily turning on and off instrumentation without needed to restart or recompile application • Only takes a few seconds to start collecting metrics once they are requested

Evaluation (4) • Software support: 3/5 • Supported languages: threaded C code, MPI code • Supported software platforms: Linux kernel version 2.4 & 2.6, AIX, Tru64, Windows 2000/XP, and IRIX • Source code correlation: 2/5 • Can correlate back to the function name level (reads executable symbol tables) • No line numbers or statement information available • Searching: 0/5 (not supported) • System stability: 2/5 • Many bugs in Linux version, but bugs seem to be limited to Paradyn GUI; DynInst seems very stable • Technical support: 4/5 • Helpful responses from our contact within 24 hours

References [1] B. P. Miller et. al. “The Paradyn parallel performance measurement tool,” IEEE Computer, November 1995, pp. 37-46. [2] J.K. Hollingsworth et. al. “MDL: A Language and Compiler for Dynamic Program Instrumentation,” IEEE PACT, 1997, pg. 201. [3] H. Cain, B.P. Miller, and B.J.N. Wylie. “A Callgraph-Based Search Strategy for Automated Performance Diagnosis,” European Conference on Parallel Computing (Euro-Par), Munich, Germany, August 2000, pg. 108.

Paradyn Evaluation Report