630 likes | 644 Vues
ISCM-10. High Performance Computing for Computational Mechanics. Moshe Goldberg March 29, 2001. Taub Computing Center. High Performance Computing for CM. Agenda:. Overview Alternative Architectures Message Passing “Shared Memory” Case Study. High Performance Computing - Overview.
E N D
ISCM-10 High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001 Taub Computing Center
High Performance Computing for CM Agenda: • Overview • Alternative Architectures • Message Passing • “Shared Memory” • Case Study
High Performance Computing • - Overview
Some Important Points * Understanding HPC concepts * Why should programmers care about the architecture? * Do compilers make the right choices? * Nowadays, there are alternatives
Trends in computer development *Speed of calculation is steadily increasing *Memory may not be in balance with high calculation speeds *Workstations are approaching speeds of especially efficient designs *Are we approaching the limit of the speed of light? * To get an answer faster, we must perform calculations in parallel
Some HPC concepts * HPC * HPF / Fortran90 * cc-NUMA * Compiler directives * OpenMP * Message passing * PVM/MPI * Beowulf
IUCC (Machba) computers Origin2000 112 cpu (R12000, 400 MHz) 28.7 GB total memory PC cluster 64 cpu (Pentium III, 550 MHz) Total memory - 9 GB Cray J90 -- 32 cpu Memory - 4 GB (500 MW) Mar 2001
Symmetric Multiple Processors Memory Memory Bus CPU CPU CPU CPU Examples: SGI Power Challenge, Cray J90/T90
Distributed Parallel Computing Memory Memory Memory Memory CPU CPU CPU CPU Examples: SP2, Beowulf
MPI commands -- examples call MPI_SEND(sum,1,MPI_REAL,ito,itag, MPI_COMM_WORLD,ierror) call MPI_RECV(sum,1,MPI_REAL,ifrom,itag, MPI_COMM_WORLD,istatus,ierror)
Some basic MPI functions Setup: mpi_init mpi_finalize Environment: mpi_comm_size mpi_comm_rank Communication: mpi_send mpi_receive Synchronization: mpi_barrier
Other important MPI functions Asynchronous communication: mpi_isend mpi_irecv mpi_iprobe mpi_wait/nowait Collective communication: mpi_barrier mpi_bcast mpi_gather mpi_scatter mpi_reduce mpi_allreduce Derived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free Creating communicators: mpi_comm_dup mpi_comm_split mpi_intercomm_create mpi_comm_free
Fortran directives --examples CRAY: CMIC$ DO ALL do i=1,n a(i)=i enddo SGI: C$DOACROSS do i=1,n a(i)=i enddo OpenMP: C$OMP parallel do do i=1,n a(i)=i enddo
OpenMP Summary OpenMP standard – first published Oct 1997 Directives Run-time Library Routines Environment Variables Versions for f77, f90, c, c++
OpenMP Summary Parallel Do Directive c$omp parallel do private(I) shared(a) do I=1,n a(I)= I+1 enddo c$omp end parallel do optional
OpenMP Summary Defining a Parallel Region - Individual Do Loops c$omp parallel shared(a,b) c$omp do private(j) do j=1,n a(j)=j enddo c$omp end do nowait c$omp do private(k) do k=1,n b(k)=k enddo c$omp end do c$omp end parallel
OpenMP Summary Parallel Do Directive - Clauses shared private default(private|shared|none) reduction({operator|intrinsic}:var) if(scalar_logical_expression) ordered copyin(var)
OpenMP Summary Run-Time Library Routines Execution environment omp_set_num_threads omp_get_num_threads omp_get_max_threads omp_get_thread_num omp_get_num_procs omp_set_dynamic/omp_get_dynamic omp_set_nested/omp_get_nested
OpenMP Summary Run-Time Library Routines Lock routines omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock omp_test_lock
OpenMP Summary Environment Variables OMP_NUM_THREADS OMP_DYNAMIC OMP_NESTED
Single CPU RISC memory levels CPU Cache Main memory
RISC memory levels Single CPU CPU Cache Main memory
RISC memory levels Multiple CPU’s CPU 0 Cache 0 CPU 1 Cache 1 Main memory
Multiple CPU’s RISC memory levels CPU 0 Cache 0 CPU 1 Cache 1 Main memory
RISC Memory Levels Multiple CPU’s CPU 0 Cache 0 CPU 1 Cache 1 Main memory
A sample program subroutine xmult (x1,x2,y1,y2,z1,z2,n) real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n) real a,b,c,d do i=1,n a=x1(i)*x2(i); b=y1(i)*y2(i) c=x1(i)*y2(i); d=x2(i)*y1(i) z1(i)=a-b; z2(i)=c+d enddo end
A sample program subroutine xmult (x1,x2,y1,y2,z1,z2,n) real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n) real a,b,c,d c$omp parallel do do i=1,n a=x1(i)*x2(i); b=y1(i)*y2(i) c=x1(i)*y2(i); d=x2(i)*y1(i) z1(i)=a-b; z2(i)=c+d enddo end
A sample program Run on Technion origin2000 Vector length = 1,000,000 Loop repeated 50 times Compiler optimization: low (-O1) Elapsed time, sec threads Compile 1 2 4 No parallel 15.0 15.3 Parallel 16.0 26.0 26.8 Is this running in parallel?
A sample program Run on Technion origin2000 Vector length = 1,000,000 Loop repeated 50 times Compiler optimization: low (-O1) Elapsed time, sec threads Compile 1 2 4 No parallel 15.0 15.3 Parallel 16.0 26.0 26.8 Is this running in parallel? WHY NOT?
A sample program Is this running in parallel? WHY NOT? c$omp parallel do do i=1,n a=x1(i)*x2(i); b=y1(i)*y2(i) c=x1(i)*y2(i); d=x2(i)*y1(i) z1(i)=a-b; z2(i)=c+d enddo Answer: by default, variables a,b,c,d are defined as SHARED
A sample program Solution: define a,b,c,d as PRIVATE: c$omp parallel do private(a,b,c,d) Elapsed time, sec threads Compile 1 2 4 No parallel 15.0 15.3 Parallel 16.0 8.5 4.6 This is now running in parallel