Code Tuning and Optimization
730 likes | 912 Vues
Code Tuning and Optimization. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Example code Timing Profiling Cache Tuning. Introduction. Timing Where is most time being used? Tuning How to speed it up
Code Tuning and Optimization
E N D
Presentation Transcript
Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization
Information Services & Technology Outline Introduction Example code Timing Profiling Cache Tuning
Information Services & Technology Introduction • Timing • Where is most time being used? • Tuning • How to speed it up • Often as much art as science • Parallel Performance • How to assess how well parallelization is working
Information Services & Technology Example Code
Information Services & Technology Example Code • Simulation of response of eye to stimuli • Response is affected by adjacent inputs • A dark area next to a bright area makes the bright area look brighter • Based on Grossberg & Todorovic paper • Appendix in paper contains all equations • errors in eqns (A4) and (A5) – cross out “log2” • Paper contains 6 levels of response • Our code only contains levels 1 through 5 • Level 6 takes a long time to compute, and would skew our timings!
Information Services & Technology Example Code (cont’d) • All calculations done on a square array • Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran) • Due to nature of algorithm, array is padded on all sides • npad is size of padding
Information Services & Technology Example Code – Level 1 bright dark Luminance (input) distribution Paper (and code) use “yin-yang square” Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow Fig. 4 in paper
Information Services & Technology Example Code – Level 2 Fig. 5 in paper Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance
Information Services & Technology Level 2 Equations Ipq=initial input (yin-yang)
Information Services & Technology Example Code – Level 3 Fig. 6(d) in paper • Oriented Direction-of-Contrast-Sensitive Units • Respond to angle • 12 discrete angles • Respond to direction of contrast, i.e., light-to-dark or dark-to-light
Information Services & Technology Level 3 Equations
Information Services & Technology Example Code - Level 4 • Oriented Direction-of-Contrast-Insensitive Units • Respond to angle • Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Fig. 8(a) in paper
Information Services & Technology Level 4 Equations
Information Services & Technology Example Code – Level 5 Level 5 – Boundary Contour Units Pool nearby excitations Fig. 8(d) in paper
Information Services & Technology Level 5 Equation
Information Services & Technology Timing • When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler
Information Services & Technology CPU Time or Wall-Clock Time? • CPU time • How much time the CPU is actually crunching away • User CPU time • Time spent executing your source code • System CPU time • Time spent in system calls such as i/o • Wall-clock time • What you would measure with a stopwatch
Information Services & Technology CPU Time or Wall-Clock Time? (cont’d) • Both are useful • For serial runs without interaction from keyboard, CPU and wall-clock times are usually close • If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not
Information Services & Technology CPU Time or Wall-Clock Time? (3) • Parallel runs • Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • Wall-clock time may not be accurate if sharing processors • Wall-clock timings should always be performed in batch mode
Information Services & Technology Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)
Information Services & Technology Unix Time Command (cont’d) input + output operations wall-clock time (s) user CPU time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc twister:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w
Information Services & Technology Unix Time Command (3) • $ time mycode • Real 1.62 • User 1.57 • System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s) Bourne shell results
Information Services & Technology Exercise 1 zero small oh capital oh • Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*. • Choose C (gt.c) or Fortran (gt.f90) • Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 • Submit rungt script to batch queue qsubrungt
Information Services & Technology Exercise 1 (cont’d) • Check status qstat–u username • After run has completed a file will appear named rungt.o??????, where ?????? represents the process number • File contains result of time command • Write down wall-clock time • Re-compile using –O3 • Re-run and check time
Information Services & Technology Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent
Information Services & Technology cpu_time real :: t1, t2 call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.' • intrinsic subroutine in Fortran • returnsuserCPU time(in seconds) • no system time is included • 0.01 sec. resolution on p-series
Information Services & Technology system_clock • intrinsic subroutine in Fortran • good for measuring wall-clocktime • on p-series: • resolution is 0.01 sec. • max. time is 24 hr.
Information Services & Technology system_clock (cont’d) integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’ t1 and t2 are tic counts count_rate is optional argument containing tics/sec.
Information Services & Technology times #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed… times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); } can be called from C to obtain CPU time 0.01 sec. resolution on p-series can also get system time with tms_stime
Information Services & Technology gettimeofday #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); } can be called from C to obtain wall-clock time msec resolution on p-series
Information Services & Technology MPI_Wtime convenient wall-clock timer for MPI codes msecresolution on p-series
Information Services & Technology MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ... do stuff to be timed ... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime(); ... do stuff to be timed ... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C
Information Services & Technology omp_get_time convenientwall-clocktimer for OpenMPcodes resolution available by calling omp_get_wtick() 0.01 sec. resolution on p-series
Information Services & Technology omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ... do stuff to be timed ... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime(); ... do stuff to be timed ... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C
Information Services & Technology Timer Summary
Information Services & Technology Exercise 2 Put wall-clock timer around each “level” in the example code Print time for each level Compile and run
Information Services & Technology Profiling
Information Services & Technology Profilers • profile tells you how much time is spent in each routine • gives a level of granularity not available with previous timers • e.g., function may be called from many places • various profilers available, e.g. • gprof (GNU) • pgprof (Portland Group) • Xprofiler (AIX)
Information Services & Technology gprof compile with -pg filegmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof
Information Services & Technology gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
Information Services & Technology gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
Information Services & Technology pgprof • compile with Portland Group compiler • pgf90 (pgf95, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile
Information Services & Technology pgprof (cont’d)
Information Services & Technology pgprof (3) • To save profile data to a file: • re-run pgprof using –textflag • at command prompt type p > filename • filename is the name you want to give the profile file • type quit to get out of profiler
Information Services & Technology Exercise 3 • Use pgprof to profile code • compile using –Mprof=func • run code • create profile using pgprof –exe gt • Note which routines use most time • Please close pgprof when you’re through • Leaving window open ties up a license
Information Services & Technology Line-Level Profiling • Times individual lines • For pgprof, compile with the flag –Mprof=line • Optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • In flat profile, double-click on function to get line-level data
Information Services & Technology Line-Level Profiling (cont’d)
Information Services & Technology Exercise 4 • Compile code with –Mprof=lineand –O0and run • will take about 5 minutes to run due to overhead from line-level profiling and lack of optimization • Examine line-level profile for most time-consuming routine • Note lines with longest time consumption • Save your profile data to a file (we will need it later) • re-run pgprof using –textflag • at command prompt type p > prof