180 likes | 335 Vues
This article explores the fundamental concepts of shared and distributed memory architectures in parallel computing. It covers various systems, including symmetric multi-processors (SMP), distributed memory systems, and hybrid models. Highlights include the benefits and limitations of each architecture, cache coherence mechanisms in SMPs, and programming models such as OpenMP and High Performance Fortran (HPF). Real-world examples illustrate the scalability and communication approaches in these models, providing a comprehensive understanding for developers and engineers in the field.
E N D
Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar Parallel Architecture Models
Shared Memory Bus c c c c P P P P Shared Memory Systems (SMP) • - Any processor can access any memory location at equal cost • (Symmetric Multi-Processor) • - Tasks “communicate” by writing/reading common locations • - Easier to program • - Cannot scale beyond around 30 PE's (bus bottleneck) • - Most workstation vendors make SMP's today (SGI, Sun, HP • Digital; Pentium) • -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)
Shared Memory Bus c c c c P P P P Cache Coherence in SMP’s - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM
NIC NIC NIC NIC M M M M c c c c P P P P Distributed Memory Systems M: Memory c: Cache P: Processor NIC: Network Interface Card Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters
M M M M Interconnection Network c c c c P P P P Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin 2000 - Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems
Shared-Address Space Models BSP(Bulk Synchronous Parallel model) HPF(High Performance Fortran) OpenMP Message Passing Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)] Higher Level Programming Environments PETSc:Portable Extensible Toolkit for Scientific computation POOMA:Parallel Object-Oriented Methods and Applications Parallel Programming Models
Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines OpenMP
Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good High Performance Fortran
Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance Message Passing
Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20) b(20,20)
Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)
Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data P0 P1 P2 P3 a(20,20) b(20,20)
Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)
P0 P0 P1 P1 al(5,20) P2 P2 P3 P3 ghost cells al(5,20) bl(0:6,20) Local partitioned view with ghost cells a(20,20) b(20,20) Message Passing: Local View communication required bl(5,20) Global shared view Local partitioned view
bl(0:6,20) Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do ……... al(5,20) ghost cells are communicated by message-passing Local partitioned view with ghost cells
Program Porting/Development Effort OpenMP = HPF << MPI Portability across systems HPF = MPI >> OpenMP (only shared-memory) Applicability MPI = OpenMP >> HPF (only dense arrays) Performance MPI > OpenMP >> HPF Comparison of Models
Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: Provides global view of data arrays System takes care of needed message-passing Portable across shared & distributed memory systems PETSc