1 / 29

Performance Analysis Tools

Performance Analysis Tools. Performance Analysis Goal. Once we have a working parallel program, we want to tune it to run faster Hot spot – An area of code that uses a significant amount of CPU time

cwhitney
Télécharger la présentation

Performance Analysis Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Analysis Tools

  2. Performance Analysis Goal • Once we have a working parallel program, we want to tune it to run faster • Hot spot – An area of code that uses a significant amount of CPU time • Bottleneck – An area of code that uses resources inefficiently and slows the program down (e.g. communication)

  3. Timers • One way to identify hot spots and bottlenecks is to use timers. • We’ve used it to measure the elapsed time of the entire algorithm, but this can be used to measure time spent on different parts of the algorithm

  4. Timers

  5. Time command • Usage: time mpirun –np # command • Result real 0m1.071s user 0m0.177s sys 0m0.045s

  6. Time command • Meaning • Real time: the total wall clock (start to finish) time your program took to load, execute, and exit. • User time: the total amount of CPU time your program took to execute. • System time: the amount of CPU time spent on operating system calls in executing your program.

  7. gettimeofday • gettimeofday is a system call that returns a structure that gives the time since Epoch (January 1 1970) int gettimeofday(struct timeval *tv, struct timezone *tz); • The timeval structure has seconds and microseconds: struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ };

  8. gettimeofday • Usage: #include <sys/time.h> struct timeval tv1, tv2; ... gettimeofday(&tv1, NULL); ... // Work to be timed gettimeofday(&tv2, NULL); // Convert time to seconds elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0);

  9. MPI_Wtime • Returns a single double-precision value that is the number of seconds since some time in the past (most likely Epoch) • MPI also provides a MPI_Wtick() routine that provides the resolution (most likely microseconds)

  10. MPI_Wtime • Usage: #include "mpi.h" ... double start,end,resolution; MPI_Init(&argc, &argv); start = MPI_Wtime(); /* start time */ ... // Work to be timed end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("elapsed= %e resolution= %e\n", end-start, resolution);

  11. MPI_Wtime • Sample output: Wallclock times(secs): start= 1384864000.067529 end= 1384864000.074005 elapsed= 6.475925e-03 resolution= 1.000000e-06 Elapsed time (seconds) Accurate to microseconds

  12. read_real_time • read_real_time is a system call that returns a structure that gives the time since Epoch (January 1 1970) int read_real_time(timebasestruct_t *t, size_t size_of_timebasestruct_t); • Designed to measure time accurate to nanoseconds • Guarantee correct time units across different IBM RS/6000 architectures.

  13. read_real_time #include <sys/time.h> ... timebasestruct_t start, finish; int secs, n_secs; read_real_time(&start, TIMEBASE_SZ); /* do some work */ read_real_time(&finish, TIMEBASE_SZ); /* Make sure both values are in seconds and nanoseconds */ time_base_to_time(&start, TIMEBASE_SZ); time_base_to_time(&finish, TIMEBASE_SZ);

  14. read_real_time • Usage continued: ... /* Subtract the starting time from the ending time */ secs = finish.tb_high - start.tb_high; n_secs = finish.tb_low - start.tb_low; /* Fix carry from low-order to high-order during the measurement */ if (n_secs < 0) { secs--; n_secs += 1000000000; } printf("Sample time was %d seconds %d nanoseconds\n", secs, n_secs);

  15. Profilers

  16. prof • Compile your program with the –p option: • gcc –p <program>.c –o <program> • Run the program • Profile file created called mon.out • Run: • prof –m mon.out

  17. Sample Output from prof

  18. gprof • Compile your program with the –p option: • gcc –gp <program>.c –o <program> • Run the program • Profile file created called gmon.out • Run: • gprof <program> gmon.out

  19. Sample Output from gprof ngranularity: Each sample hit covers 4 bytes. Time: 1.17 seconds called/total parents index %time self descendents called+self name index called/total children 0.44 0.72 1/1 .__start [2] [1] 99.1 0.44 0.72 1 .main [1] 0.59 0.13 1024/1024 .fft [3] 0.00 0.00 256/256 .cos [6] 0.00 0.00 256/256 .sin [7] 0.00 0.00 8/8 .gettimeofday [11] 0.00 0.00 7/7 .printf [16] 0.00 0.00 1/1 .atoi [31] 0.00 0.00 1/1 .exit [33]

  20. xprofiler • X Windows profiler based on gprof • Compile and run the program as you would with gprof • Run: • xprofiler <program> gmon.out • Provides a graphical representation of the program execution

  21. Library View

  22. Function View

  23. mpiP • Compile an MPI program with –g: • mpcc -g <program>.c –o <program> -L/usr/local/tools/mpiP/lib -lmpiP -lbfd • Run the MPI program as usual • A file is created called <program>.N.XXXXX.mpiP • Where N is the number of processors and XXXXX is the collector task processor id

  24. Sample output from mpiP @ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0

  25. Sample output from mpiP ---------------------------------------------------------------- @--- MPI Time (seconds) ---------------------------------------- ---------------------------------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06

  26. Sample output from mpiP ---------------------------------------------------------- @--- Callsites: 38 --------------------------------------- ---------------------------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 …

  27. Sample output from mpiP ----------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ------ ----------------------------------------------------------------- Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 …

  28. Sample output from mpiP ------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------- ------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 ...

  29. Questions

More Related