Introduction to Scientific Computing

Introduction to Scientific Computing Shubin Liu, Ph.D. Research Computing Center University of North Carolina at Chapel Hill

Course Goals • An introduction to high-performance computing and UNC Research Computing Center • Available Research Computing hardware facilities • Available software packages • Serial/parallel programming tools and libraries • How to efficiently make use of Research Computing facilities on campus

Agenda • Introduction to High-Performance Computing • Hardware Available • Servers, storage, file systems, etc. • How to Access • Programming Tools Available • Compilers & Debugger tools • Utility Libraries • Parallel Computing • Scientific Packages Available • Job Management • Hands-on Exercises– 2nd hour The PPT format of this presentation is available here: http://its2.unc.edu/divisions/rc/training/scientific/ /afs/isis/depts/its/public_html/divisions/rc/training/scientific/short_courses/

Pre-requisites • An account on Emerald cluster • UNIX Basics Getting started: http://help.unc.edu/?id=5288 Intermediate: http://help.unc.edu/?id=5333 vi Editor: http://help.unc.edu/?id=152 Customizing: http://help.unc.edu/?id=208 Shells: http://help.unc.edu/?id=5290 ne Editor: http://help.unc.edu/?id=187 Security: http://help.unc.edu/?id=217 Data Management: http://help.unc.edu/?id=189 Scripting: http://help.unc.edu/?id=213 HPC Application: http://help.unc.edu/?id=4176

About Us ITS – Information Technology Services http://its.unc.edu http://help.unc.edu Physical locations: 401 West Franklin St. 211 Manning Drive 10 Divisions/Departments Information SecurityIT Infrastructure and Operations Research Computing CenterTeaching and Learning User Support and EngagementOffice of the CIO Communication TechnologiesEnterprise Resource Planning Enterprise ApplicationsFinance and Administration

Research Computing Center Where and who are we and what do we do? ITS Manning: 211 Manning Drive Website http://its.unc.edu/Research Groups Infrastructure -- Hardware User Support -- Software Engagement -- Collaboration

About Myself Ph.D. from Chemistry, UNC-CH Currently Senior Computational Scientist @ Research Computing Center, UNC-CH Responsibilities: Support Computational Chemistry/Physics/Material Science software Support Programming (FORTRAN/C/C++) tools, code porting, parallel computing, etc. Offer short training courses for campus users Conduct research and engagement projects in Computational Chemistry Development of DFT theory and concept tools Applications in biological and material science systems

What is Scientific Computing? • Short Version • To use high-performance computing (HPC) facilities to solve real scientific problems. • Long Version, from Wikipedia.com • Scientific computing (or computational science) is the field of study concerned with constructing mathematical models and numerical solution techniques and using computers to analyze and solve scientific and engineering problems. In practical use, it is typically the application of computer simulation and other forms of computation to problems in various scientific disciplines.

What is Scientific Computing? Engineering Sciences Theory/Model Layer Algorithm Layer Scientific Computing Computer Science Applied Mathematics Hardware/Software Natural Sciences Application Layer • From scientific discipline viewpoint • From operational viewpoint High- Performance Computing Scientific Computing Parallel Computing • From Computing Perspective

What is HPC? • Computing resources which provide more than an order of magnitude more computing power than current top-end workstations or desktops – generic, widely accepted. • HPC ingredients: • large capability computers (fast CPUs) • massive memory • enormous (fast & large) data storage • highest capacity communication networks (Myrinet, 10 GigE, InfiniBand, etc.) • specifically parallelized codes (MPI, OpenMP) • visualization

Why HPC? • What are the three-dimensional structures of all of the proteins encoded by an organism's genome and how does structure influence function, both spatially and temporally? • What patterns of emergent behavior occur in models of very large societies? • How do massive stars explode and produce the heaviest elements in the periodic table? • What sort of abrupt transitions can occur in Earth’s climate and ecosystem structure? • How do these occur and under what circumstances? If we could design catalysts atom-by-atom, could we transform industrial synthesis? • What strategies might be developed to optimize management of complex infrastructure systems? • What kind of language processing can occur in large assemblages of neurons? • Can we enable integrated planning and response to natural and man-made disasters that prevent or minimize the loss of life and property? http://www.nsf.gov/pubs/2005/nsf05625/nsf05625.htm

Measure of Performance 1 CPU, Units in MFLOPS (x106) Machine/CPU Type LINPACK Performance Peak Performance Intel Pentium 4 (2.53 GHz) 2355 5060 Mega FLOPS (x106) Giga FLOPS (x109) Tera FLOPS (x1012) Peta FLOPS (x1015) Exa FLOPS (x1018) Zetta FLOPS (x1021) Yotta FLOPS (x1024) NEC SX-6/1 (1proc. 2.0 ns) 7575 8000 HP rx5670 Itanium2 (1GHz) 3528 4000 IBM eServer pSeries 690 (1300 MHz) 2894 5200 Cray SV1ex-1-32(500MHz) 1554 2000 Compaq ES45 (1000 MHz) 1542 2000 AMD Athlon MP1800+(1530MHz) 1705 3060 Intel Pentium III (933 MHz) 507 933 http://en.wikipedia.org/wiki/FLOPS SGI Origin 2000 (300 MHz) 533 600 Intel Pentium II Xeon (450 MHz) 295 450 Sun UltraSPARC (167MHz) 237 333 Reference: http://performance.netlib.org/performance/html/linpack.data.col0.html

How to Quantify Performance? TOP500 • A list of the 500 most powerful computer systems over the world • Established in June 1993 • Compiled twice a year (June & November) • Using LINPACK Benchmark code (solving linear algebra equation aX=b ) • Organized by world-wide HPC experts, computational scientists, manufacturers, and the Internet community • Homepage: http://www.top500.org

TOP500:November 2007 TOP 5, Units in GFLOPS (=1024 MGLOPS) Rank Installatio Site /Year ManufacturerComputer/Procs RmaxRpeak 1 DOE/NNSA/LLNLUnited States/2007 BlueGene/LeServer Blue Gene Solution / 212,992, IBM 478,200596,378 2 Forschungszentrum Juelich (FZJ)Germany/2007 JUGENE - Blue Gene/P SolutionIBM 65,536 167,300 222,822 3 SGI/New Mexico Computing Applications Center (NMCAC)United States/2007 SGI Altix ICE 8200, Xeon quad core 3.0 GHz, SGI 14,336 126,900 172,032 4 Computational Research Laboratories, TATA SONSIndia/2007 EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, InfinibandHewlett-Packard, 14,240 117.900170,800 5 Government AgencySweden/2007 Cluster Platform 3000 BL460c, Xeon 53xx 2.66GHz, InfinibandHewlett-Packard, 13,728 102,800 146,430 36 University of North CarolinaUnited States/2007 Topsail - PowerEdge 1955, 2.33 GHz, Cisco/Topspin Infiniband, Dell, 4160 28,770 38821.1

TOP500: June 2008 Rmax and Rpeak values are in TFlops. Power data in KW for entire system

TOP500: June 2009

TOP500 History of UNC-CH Entry

Shared/Distributed-Memory Architecture CPU CPU CPU CPU CPU CPU CPU CPU M M M M BUS NETWORK MEMORY Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (examples: Baobab, the new Dell Cluster) Shared memory - single address space. All processors have access to a pool of shared memory. (examples: Chastity/zephyr, happy/yatta, cedar/cypress, sunny) Methods of memory access : Bus and Crossbar

What is a Beowulf Cluster? • A Beowulf system is a collection of personal computers constructed from commodity-off-the-shelf hardware components interconnected with a system-area-network and configured to operate as a single unit, parallel computing platform (e.g., MPI), using an open-source network operating system such as LINUX. • Main components: • PCs running LINUX OS • Inter-node connection with Ethernet, Gigabit, Myrinet, InfiniBand, etc. • MPI (message passing interface)

LINUX Beowulf Clusters

What is Parallel Computing ? • Concurrent use of multiple processors to process data • Running the same program on many processors. • Running many programs on each processor.

Advantages of Parallelization • Cheaper, in terms of Price/Performance Ratio • Faster than equivalently expensive uniprocessor machines • Handle bigger problems • More scalable: the performance of a particular program may be improved by execution on a large machine • More reliable: In theory if processors fail we can simply use others

Catch: Amdahl's Law Speedup = 1/(s+p/n)

Parallel Programming Tools • Share-memory architecture • OpenMP • Distributed-memory architecture • MPI, PVM, etc.

OpenMP • An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism • What does OpenMP stand for? • Open specifications for Multi Processing via collaborative work between interested parties from the hardware and software industry, government and academia. • Comprised of three primary API components: • Compiler Directives • Runtime Library Routines • Environment Variables • Portable: • The API is specified for C/C++ and Fortran • Multiple platforms have been implemented including most Unix platforms and Windows NT • Standardized: • Jointly defined and endorsed by a group of major computer hardware and software vendors • Expected to become an ANSI standard later??? • Many compilers can automatically parallelize a code with OpenMP!

OpenMP Example (FORTRAN) PROGRAM HELLO INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Fork a team of threads giving them their own copies of variables !$OMP PARALLEL PRIVATE(TID) C Obtain and print thread id TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID C Only master thread does this IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads = ', NTHREADS END IF C All threads join master thread and disband !$OMP END PARALLEL END

The Message Passing Model • Parallelization scheme for distributed memory. • Parallel programs consist of cooperating processes, each with its own memory. • Processes send data to one another as messages • Message can be passed around among compute processes • Messages may have tags that may be used to sort messages. • Messages may be received in any order.

MPI: Message Passing Interface • Message-passing model • Standard (specification) • Many implementations (almost each vendor has one) • MPICH and LAM/MPI from public domain most widely used • GLOBUS MPI for grid computing • Two phases: • MPI 1: Traditional message-passing • MPI 2: Remote memory, parallel I/O, and dynamic processes • Online resources • http://www-unix.mcs.anl.gov/mpi/index.htm • http://www-unix.mcs.anl.gov/mpi/mpich/ • http://www.lam-mpi.org/ • http://www.mpi-forum.org • http://www-unix.mcs.anl.gov/mpi/tutorial/learning.html

A Simple MPI Code #include "mpi.h" #include <stdio.h> int main( argc, argv ) int argc; char **argv; { MPI_Init( &argc, &argv ); printf( "Hello world\n" ); MPI_Finalize(); return 0; } include ‘mpif.h’integer myid, ierr, numprocscall MPI_INIT( ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, numprocs,ierr)write(*,*) ‘Hello from ‘, myidwrite(*,*) ‘Numprocs is’, numprocscall MPI_FINALIZE(ierr) end C Version FORTRAN Version

Other Parallelization Models • VIA: Virtual Interface Architecture -- Standards-based Cluster Communications • PVM: a portable message-passing programming system, designed to link separate host machines to form a ``virtual machine'' which is a single, manageable computing resource. It’s largely an academic effort and there has been no much development since 1990s. • BSP: Bulk Synchronous Parallel Model, a generalization of the widely researched PRAM (Parallel Random Access Machine) model • Linda:a concurrent programming model from Yale, with the primary concept of ``tuple-space'' • HPF: PGI’s first standard parallel programming language for shared and distributed-memory systems.

RC Servers @ UNC-CH • SGI Altix 3700 – SMP, 128 CPUs, cedar/cypress • Emerald LINUX Cluster – Distributed memory, ~500 CPUs, emerald • yatta/p575 IBM AIX nodes • Dell LINUX cluster – Distributed memory 4160 CPUs, topsail

IBM P690/P575 SMP • IBM pSeries 690/P575 Model 6C4, Power4+ Turbo, 32 1.7 GHz processors • - access to 4TB of NetApp NAS RAID array used for scratch space, mounted as /nas and /netscr • OS: IBM AIX 5.3 Maintenance Level 04 • login node: emerald.isis.unc.edu • compute node: • yatta.isis.unc.edu 32 CPUs • P575-n00.isis.unc.edu 16 CPUs • P575-n01.isis.unc.edu 16 CPUs • P575-n02.isis.unc.edu 16 CPUs • P575-n03.issi.unc.edu 16 CPUs

SGI Altix 3700 SMP • Servers for Scientific Applications such as Gaussian, Amber, and custom code • Login node: cedar.isis.unc.edu • Compute node: cypress.isis.unc.edu • Cypress: SGI Altix 3700bx2 - 128 Intel Itanium2 Processors (1600MHz), each with 16k L1 cache for data, 16k L1 cache for instructions, 256k L2 cache, 6MB L3 cache, 4GB of Shared Memory (512GB total memory) • Two 70 GB SCSI System Disks as /scr

SGI Altix 3700 SMP • Cedar: SGI Altix 350 - 8 Intel Itanium2 Processors (1500MHz), each with 16k L1 cache for data, 16k L1 cache for instructions, 256k L2 cache, 4MB L3 cache, 1GB of Shared Memory (8GB total memory), two 70 GB SATA System Disks. • RHEL 3 with Propack 3, Service Pack 3 • No AFS (HOME & pkg space) access • Scratch Disk: /netscr, /nas, /scr

Emerald Cluster • General purpose Linux Cluster for Scientific and Statistical Applications • Machine Name: emerald.isis.unc.edu • 2 Login Nodes: IBM BladeCenter, one Xeon 2.4GHz, 2.5GB RAM and one Xeon 2.8GHz, 2.5GB RAM • 18 Compute Nodes: Dual AMD Athlon 1600+ 1.4GHz MP Processor, Tyan Thunder MP Motherboard, 2GB DDR RAM on each node • 6 Compute Nodes: Dual AMD Athlon 1800+ 1.6GHz MP Processor, Tyan Thunder MP Motherboard, 2GB DDR RAM on each node • 25 Compute Nodes: IBM BladeCenter, Dual Intel Xeon 2.4GHz, 2.5GB RAM on each node • 96 Compute Nodes: IBM BladeCenter, Dual Intel Xeon 2.8GHz, 2.5GB RAM on each node • 15 Compute Nodes: IBM BladeCenter, Dual Intel Xeon 3.2GHz, 4.0GB RAM on each node • Access to 10 TB of NetApp NAS RAID array used for scratch space, mounted as /nas and /scr • Login: emerald.isis.unc.edu • Access to 7TB of NetApp NAS RAID array used for scratch space, mounted as /nas and /scr • OS: RedHat Enterprise Linux 3.0 • TOP500: 395th place in the June 2003 release.

Dell LINUX Cluster, Topsail • 520 dual nodes (4160 CPUs) Xeon (EM64T) • 3.6GHz, 2MB L1 cache 2GB memory per CPU • InfiniBand inter-node connection • Not AFS mounted, not open to general public • Access based on peer-reviewed proposal • HPL: 6.252 Teraflops, 74th in 2006 JuneTOP500 list and 104th in the November 2006 list and 25th in the June 2007 list (28.77 teraflopsafter upgrade)

Topsail • Login node : topsail.unc.edu 8 CPUs @ 2.3 GHz Intel EM64T with 2x4M L2 cache (Model E5345/Clovertown), 12 GB memory • Compute nodes : 4,160 CPUs @ 2.3 GHz Intel EM64T with 2x4M L2 cache (Model E5345/Clovertown), 12 GB memory • Shared Disk : (/ifs1) 39 TB IBRIX Parallel File System • Interconnect: Infiniband 4x SDR • Resource management is handled by LSF v.7.2, through which all computational jobs are submitted for processing

File Systems • AFS (Andrew File System): AFS is a distributed network file system that enables files from any AFS machine across the campus to be accessed as easily as files stored locally. • As ISIS HOME for all users with an ONYEN – the Only Name You’ll Ever Need • Limited quote: 250 MB for most users [type “fs lq” to view] • Current production version openafs-1.3.8.6 • Files backed up daily [ ~/OldFiles ] • Directory/File tree: /afs/isis/home/o/n/onyen • For example: /afs/isis/home/m/a/mason, where “mason” is the ONYEN of the user • Accessible from emerald, happy/yatta • But not from cedar/cypress, topsail • Not sutiable for research computing tasks! • Recommended to compile, run I/O intensive jobs on /scr or /netscr • More info: http://help.unc.edu/?id=215#d0e24

Basic AFS Commands • To add or remove packages • ipm add pkg_name, ipm remove pkg_name • To find out space quota/usage • fs lq • To see and review AFS tokens (read/write-able), which expires in 25 hours • tokens, klog • Over 300 packages installed in AFS pkg space • /afs/isis/pkg/ • More info available at • http://its.unc.edu/dci/dci_components/afs/

Data Storage • Local Scratch: /scr – local to a machine • Cedar/cypress: 2x500 GB SCSI System Disks • Topsail: /ifs1/scr 39 TB IBRIX Parallel File System • Happy/yatta: 2x500 GB Disk Drives • For running jobs, temporary data storage, not backed up • Network Attached Storage (NAS) – for temporary storage • /nas/uncch, /netscr • >20TB of NetApp NAS RAID array used for scratch space, mounted as /nas and /scr • For running jobs, temporary data storage, not backed up • Shared by all login and compute nodes (cedar/cypress, happy/yatta, emerald) • Mass Storage (MS) – for permanent storage • Mounted for long term data storage on all scientific computing servers’ login nodes as ~/ms ($HOME/ms) • Never run jobs in ~/ms (compute nodes do not have ~/ms access)

Subscription of Services • Have an ONYEN ID • The Only Name You’ll Ever Need • Eligibility: Faculty, staff, postdoc, and graduate students • Go to http://onyen.unc.edu

Access to Servers • To Emerald • ssh emerald.isis.unc.edu • To cedar • ssh cedar.isis.unc.edu • To Topsail • ssh topsail.unc.edu

Programming Tools • Compilers • FORTRAN 77/90/95 • C/C++ • Utility Libraries • BLAS, LAPACK, FFTW, SCALAPACK • IMSL, NAG, • NetCDF, GSL, PETSc • Parallel Computing • OpenMP • PVM • MPI (MPICH, LAM/MPI, OpenMPI, MPICH2)

Compilers: SMP Machines • Cedar/Cypress – SGI Altix 3700, 128 CPUs • 64-bit Intel Compiler versions 9.1 and 10.1, /opt/intel • FORTRAN 77/90/95: ifort/ifc/efc • C/C++: icc/ecc • 64-bit GNU compilers • FORTRAN 77 f77/g77 • C and C++ gcc/cc and g++/c++ • Yatta/P575 – IBM P690/P575, 32/64CPUs • XL FORTRAN 77/90 8.1.0.3 xlf, xlf90 • C and C++ AIX 6.0.0.4 xlc, xlC

Compilers: LINUX Cluster • Absoft ProFortran Compilers • Package Name: profortran • Current Version: 7.0 • FORTRAN 77 (f77): Absoft FORTRAN 77 compiler version 5.0 • FORTRAN 90/95 (f90/f95): Absoft FORTRAN 90/95 compiler version 3.0 • GNU Compilers • Package Name: gcc • Current Version: 4.1.2 • FORTRAN 77 (g77/f77): 3.4.3, 4.1.2 • C (gcc): 3.4.3, 4.1.2 • C++ (g++/c++): 3.4.3, 4.1.2 • Intel Compilers • Package Name: intel_fortran intel_CC • Current Version: 10.1 • FORTRAN 77/90 (ifc): Intel LINUX compiler version 8.1, 9.0, 10.1 • CC/C++ (icc): Intel LINUX compiler version 8.1, 9.0, 10.1 • Portland Group Compilers • Package Name: pgi • Current Version: 7.1.6 • FORTRAN 77 (pgf77): The Portland Group, Inc. pgf77 v6.0, 7.0.4, 7.1.3 • FORTRAN 90 (pgf90): The Portland Group, Inc. pgf90 v6.0, 7.0.4, 7.1.3 • High Performance FORTRAN (pghpf): The Portland Group, Inc. pghpf v6.0, 7.0.4, 7.1.3 • C (pgcc): The Portland Group, Inc. pgcc v6.0, 7.0.4, 7.1.3 • C++ (pgCC): The Portland Group, Inc. v6.0, 7.0.4, 7.1.3

LINUX Compiler Benchmark Absoft ProFortran 90 Intel FORTRAN 90 Portland Group FORTRAN 90 GNU FORTRAN 77 Molecular Dynamics (CPU time) 4.19 (4) 2.83 (2) 2.80 (1) 2.89 (3) Kepler (CPU Time) 0.49 (1) 0.93 (2) 1.10 (3) 1.24 (4) Linpack (CPU Time) 98.6 (4) 95.6 (1) 96.7 (2) 97.6 (3) Linpack (MFLOPS) 182.6 (4) 183.8 (1) 183.2 (3) 183.3 (2) LFK (CPU Time) 89.5 (4) 70.0 (3) 68.7 (2) 68.0 (1) LFK (MFLOPS) 309.7 (3) 403.0 (2) 468.9 (1) 250.9 (4) Total Rank 20 11 12 17 • For reference only. Notice that performance is code and compilation flag dependent. For each benchmark, three • identical runs were performed and the best CPU timing was chosen among the three and then listed in the Table. • Optimization flags: for Absoft -O, Portland Group -O4 -fast, Intel -O3, GNU -O

Profilers & Debuggers • SMP machines • Happy/yatta: dbx, prof, gprof • Cedar/cypress: gprof • LINUX Cluster • PGI: pgdebug, pgprof, gprof • Absoft: fx, xfx, gprof • Intel: idb, gprof • GNU: gdb, gprof

Utility Libraries • Mathematic Libraries • IMSL, NAG, etc. • Scientific Computing • Linear Algebra • BLAS, ATLAS • EISPACK • LAPACK • SCALAPACK • Fast Fourier Transform, FFTW • BLAS/LAPACK, ScaLAPACK • The GNU Scientific Library, GSL • Utility Libraries, netCDF, PETSc, etc.

Utility Libraries • SMP Machines • Yatta/P575: ESSL (Engineering and Scientific Subroutine Library), -lessl • BLAS • LAPACK • EISPACK • Fourier Transforms, Convolutions and Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation • Utilities

Introduction to Scientific Computing