290 likes | 315 Vues
Performance Analysis Tools. Performance Analysis Goal. Once we have a working parallel program, we want to tune it to run faster Hot spot – An area of code that uses a significant amount of CPU time
E N D
Performance Analysis Goal • Once we have a working parallel program, we want to tune it to run faster • Hot spot – An area of code that uses a significant amount of CPU time • Bottleneck – An area of code that uses resources inefficiently and slows the program down (e.g. communication)
Timers • One way to identify hot spots and bottlenecks is to use timers. • We’ve used it to measure the elapsed time of the entire algorithm, but this can be used to measure time spent on different parts of the algorithm
Time command • Usage: time mpirun –np # command • Result real 0m1.071s user 0m0.177s sys 0m0.045s
Time command • Meaning • Real time: the total wall clock (start to finish) time your program took to load, execute, and exit. • User time: the total amount of CPU time your program took to execute. • System time: the amount of CPU time spent on operating system calls in executing your program.
gettimeofday • gettimeofday is a system call that returns a structure that gives the time since Epoch (January 1 1970) int gettimeofday(struct timeval *tv, struct timezone *tz); • The timeval structure has seconds and microseconds: struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ };
gettimeofday • Usage: #include <sys/time.h> struct timeval tv1, tv2; ... gettimeofday(&tv1, NULL); ... // Work to be timed gettimeofday(&tv2, NULL); // Convert time to seconds elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0);
MPI_Wtime • Returns a single double-precision value that is the number of seconds since some time in the past (most likely Epoch) • MPI also provides a MPI_Wtick() routine that provides the resolution (most likely microseconds)
MPI_Wtime • Usage: #include "mpi.h" ... double start,end,resolution; MPI_Init(&argc, &argv); start = MPI_Wtime(); /* start time */ ... // Work to be timed end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("elapsed= %e resolution= %e\n", end-start, resolution);
MPI_Wtime • Sample output: Wallclock times(secs): start= 1384864000.067529 end= 1384864000.074005 elapsed= 6.475925e-03 resolution= 1.000000e-06 Elapsed time (seconds) Accurate to microseconds
read_real_time • read_real_time is a system call that returns a structure that gives the time since Epoch (January 1 1970) int read_real_time(timebasestruct_t *t, size_t size_of_timebasestruct_t); • Designed to measure time accurate to nanoseconds • Guarantee correct time units across different IBM RS/6000 architectures.
read_real_time #include <sys/time.h> ... timebasestruct_t start, finish; int secs, n_secs; read_real_time(&start, TIMEBASE_SZ); /* do some work */ read_real_time(&finish, TIMEBASE_SZ); /* Make sure both values are in seconds and nanoseconds */ time_base_to_time(&start, TIMEBASE_SZ); time_base_to_time(&finish, TIMEBASE_SZ);
read_real_time • Usage continued: ... /* Subtract the starting time from the ending time */ secs = finish.tb_high - start.tb_high; n_secs = finish.tb_low - start.tb_low; /* Fix carry from low-order to high-order during the measurement */ if (n_secs < 0) { secs--; n_secs += 1000000000; } printf("Sample time was %d seconds %d nanoseconds\n", secs, n_secs);
prof • Compile your program with the –p option: • gcc –p <program>.c –o <program> • Run the program • Profile file created called mon.out • Run: • prof –m mon.out
gprof • Compile your program with the –p option: • gcc –gp <program>.c –o <program> • Run the program • Profile file created called gmon.out • Run: • gprof <program> gmon.out
Sample Output from gprof ngranularity: Each sample hit covers 4 bytes. Time: 1.17 seconds called/total parents index %time self descendents called+self name index called/total children 0.44 0.72 1/1 .__start [2] [1] 99.1 0.44 0.72 1 .main [1] 0.59 0.13 1024/1024 .fft [3] 0.00 0.00 256/256 .cos [6] 0.00 0.00 256/256 .sin [7] 0.00 0.00 8/8 .gettimeofday [11] 0.00 0.00 7/7 .printf [16] 0.00 0.00 1/1 .atoi [31] 0.00 0.00 1/1 .exit [33]
xprofiler • X Windows profiler based on gprof • Compile and run the program as you would with gprof • Run: • xprofiler <program> gmon.out • Provides a graphical representation of the program execution
mpiP • Compile an MPI program with –g: • mpcc -g <program>.c –o <program> -L/usr/local/tools/mpiP/lib -lmpiP -lbfd • Run the MPI program as usual • A file is created called <program>.N.XXXXX.mpiP • Where N is the number of processors and XXXXX is the collector task processor id
Sample output from mpiP @ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0
Sample output from mpiP ---------------------------------------------------------------- @--- MPI Time (seconds) ---------------------------------------- ---------------------------------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06
Sample output from mpiP ---------------------------------------------------------- @--- Callsites: 38 --------------------------------------- ---------------------------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 …
Sample output from mpiP ----------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ------ ----------------------------------------------------------------- Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 …
Sample output from mpiP ------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------- ------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 ...