230 likes | 348 Vues
This document presents parallel programming strategies for calculating the sum of squares of the first 'n' integers using both Message Passing Interface (MPI) and OpenMP. It includes example codes for improved efficiency through parallelization techniques. The MPI implementation leverages multiple processes, while the OpenMP version utilizes shared memory for enhanced performance on multi-core systems. Key focus areas include optimizing code for performance, employing reduction operations for accumulation, and profiling to identify computational hotspots.
E N D
么石磊 博士 亚太技术网络(APTN) stone@beijing.sgi.com
Data Data N threads N sub-domains 1 Domain
APO -APO C$OMP
Example for Parallelisation program sum_squares integer I,n real*8 sum print *,’enter n:’ read *,n sum=0 do I=1,n sum = sum + dble(I)**2 end do print *, ‘sum of first’,n,’ squares =‘,sum end
Rewriting Parallelisation code Message Passing Parallel Programming program sum_squares include “mpif.h integer I,n,myid,numprocs real*8 my_sum,sum call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr) if(myid.eq.0) then print *,’enter n:’ read *,n endif call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD, ierr) my_sum = 0 do I=1+myid*n/numprocs,(myid+1)*n/numprocs my_sum = my_sum + dble(I)**2 end do call MPI_REDUCE(my_sum,sum,1,MPI_DOUBLE_PRECISION, & MPI_SUM,0,MPI_COMM_WORLD,ierr) if(myhid,eq.0) then print *, ‘sum of first’,n,’ squares =‘,sum endif call MPI_FINALLIZE(ierr) end Shared Memory Parallel Programming program sum_squares integer I,n real*8 sum print *,’enter n:’ read *,n sum=0 !$OMP PARALLEL DO reduction(+:sum) do I=1,n sum = sum + dble(I)**2 end do print *, ‘sum of first’,n,’ squares =‘,sum end Added Shared Memory Code Added Dist. Memory Message Passing Code
Parallelisation Strategy • port code if necessary. Make sure it gives correct answer • Profile the code, work only on loops and routines that use a large percentage of cpu time (think about coarse-grained parallelism) • Use APO and be done(!?) or • Hand tune for loop level parallelization • To further improve performance, look for coarse-grained parallelism opportunities.