Cluster OpenMP Benchmark of 64-bit PC Cluster

C$OMP PARALLEL private(i,j,k,rx,ry,r,ir3,Fxk,Fyk) C$OMP DO schedule(guided) do j=1,N do k=1,j-1 …. Direct nbody calculation ….. enddo enddo C$OMP BARRIER C$OMP CRITICAL do j=1,N Fx(j) = Fx(j) + Fxk(j) Fy(j) = Fy(j) + Fyk(j) enddo C$OMP END CRITICAL C$OMP END PARALLEL the main code Intel dual core AMD2 Intel single cpu efficiency 1x8 4x2 1x4 2x4 2x2 2x2 2x4 1x2 4x1 1x2 number of threads number of threads number of threads 2x1 2x2 1x4 2x1 2x1 1x2 relative speed Fig2 Fig3 Fig1 dual AMD2 dual AMD efficiency 4x2 2x4 1x4 intel single cpu Intel dual core 4x1 dual AMD2 2x2 2x2 dual AMD 2x1 2x1 number of threads number of threads AMD2 1x2 1x2 Fig4 Fig5 dual AMD Intel dual core dual AMD2 AMD2 dual AMD2 Intel dual-core dual AMD Intel single Fig7 Fig8 Fig9 Fig10 Cluster OpenMP Benchmark of 64-bit PC Cluster Yao-Huan Tseng Institute of Astronomy and Astrophysics, Academia Sinica, ROC Abstract Although the MPI has become accepted as a portable style of parallel programming, but has several significant weaknesses that limit its effectiveness and scalability. MPI in general is difficult to program and doesn't support incremental parallelization of an existing sequential program. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), e.g. dual-core, quad-core, dual dual-core. Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. I have implemented a direct nbody program with OpenMP and test on IAA PC Cluster to demonstrate the feasibility and efficiency. The most benefit is easy to use. Hardware 8 Intel single-cpu PCs (3GHz P4, 2GB DDR, 80GB IDE, Gbps NIC) 4 AMD2 PCs (1.0GHz AMD dual core 5000+, 2GB DDR2, 120GB IDE, Gbps NIC) 4 dual AMD PCs (1.8GHz AMD Opetron244, 2GB DDR, 120GB IDE, Gbps NIC) 2 dual AMD2 PCs (2GHz Opteron270 dual core, 8GB DDR2, 120GB IDE, Gbps NIC) 2 Intel dual-core PCs (3GHz Pentium, 2GB DDR, 80GB IDE, Gbps NIC) • Software • a simple direct nbody program(particle number = 100,000) • OpenMP directives OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism in Fortran and C/C++ programs. • Intel fortran compiler v9.1 with Cluster OpenMP Cluster OpenMP* is an implementation of OpenMP that can make use of multiple SMP machines without resorting to MPI. node x thread 1 x 2 : 1 node, each one with 2 threads 2 x 1 : 2 nodes, each one with 1 threads 2 x 2 : 2 nodes, each one with 2 threads 4 x 1 : 4 nodes, each one with 1 thread 1 x 4 : 1 nodes, each one with 4 threads 4 x 2 : 4 nodes, each one with 2 threads 2 x 4 : 2 nodes, each one with 4 threads Fig6 • Summary • Only six OpenMP directives added in the code, the efficiency of a direct nbody simulation can speed up factor of 7 with 8 AMD CPUs running simultaneously (Fig4.). • Although the frequency of AMD CPU is lower than Intel CPU, the performance per GHz is almost the same for different CPUs. Except the AMD2 dual-core, AMD2 is about 3 times of the others (Fig6). • Because of limiting memory bandwidth, the efficiency of dual CPU is better than dual-core CPU (Fig2 and Fig4). The hyper-threading technology of Intel CPU is the worst case. Do not submit more than one thread at one single CPU. • In Linpack test, the performance of two CPUs is two times of one CPU for a dual AMD machine (Fig10). But the performance is only 1.01 for a Intel dual-core machine (Fig8). 2007/3/20

Cluster OpenMP Benchmark of 64-bit PC Cluster

Cluster OpenMP Benchmark of 64-bit PC Cluster

Presentation Transcript

Introduction to Linux and PC Cluster

64-Bit Architectures

64-Bit Architectures

Dolly+ for system management of PC Linux cluster

MARC CLUSTER BENCHMARK

Cluster splitting - all cluster pairs

Cluster

Using SDSC TeraGrid IA-64 Cluster

CFD Benchmark of 64-bit PC Cluster

CFD-MHD PC Cluster

Cluster

PC Cluster Reconstruction

PC Cluster Reconstruction Algorithm

Cluster

64 Bit Deconvolution

How to make PC Cluster Systems?