Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Anne Weill-Zrahia With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Mar 2005

Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI

4) Parallel Programming-OpenMP

Is this your joint bank account? Limor in Haifa Take IL150 (Write IL350) IL 0 IL150 Read IL500 IL500 IL100 IL350 IL500 IL500 IL350 Read IL500 Take IL400 (Write IL100) IL 0 IL400 Shimon in Tel Aviv Final amount Initial amount

Introduction • Parallelization instruction to the compiler: • f77 –o prog –mp prog.f • Or: f77 –o prog –pfa prog.f -Now try to understand what a compiler has to determine when deciding how to parallelize - Note that when loosely talk about parallelization, what is meant is: “Is the program as presented here parallelizable?” • This is an important distinction, because sometimes rewriting • can transform non-parallelizable code into a parallelizable • form, as we will see…

Data dependency types • Iteration i depends on values calculated in the previous iteration i-1 • (loop carried dependence) • do i=2,n • a(i) = a(i-1)cannot be parallelized • enddo 2) Data dependence within single iteration (non-loop carried dependence) do i=2,n c = . . . . a(I) = . . . c . . .parallelizable enddo 3) Reduction do i=1,n s = s + xparallelizable enddo All data dependencies in programs are variations on these fundamental types.

Data dependency analysis Question: Are the following loops parallelizable? do i=2,n a(i) = b(i-1) enddo YES! do i=2,n a(i) = a(i-1) enddo NO! Why?

Data dependency analysis do i=2,n a(i) = b(i-1) enddo YES! cycle2 cycle1 A(5)=B(4) A(6)=B(5) A(7)=B(6) A(2)=B(1) A(3)=B(2) A(4)=B(3) CPU1 CPU2 CPU3

Data dependency analysis do i=2,n a(i) = a(i-1) enddo Scalar (non-parallel) run: cycle2 cycle3 cycle4 cycle1 A(4)=A(3) A(5)=A(4) A(3)=A(2) A(2)=A(1) CPU1 In each cycle NEW data from previous cycle is read

Data dependency analysis do i=2,n a(i) = a(i-1) enddo No! cycle1 A(2)=A(1) A(3)=A(2) A(4)=A(3) CPU1 CPU2 CPU3 Will probably read OLD data

Data dependency analysis Data dependency analysis do i=2,n a(i) = a(i-1) enddo No! cycle2 cycle1 May read NEW data A(5)=A(4) A(6)=A(5) A(7)=A(6) A(2)=A(1) A(3)=A(2) A(4)=A(3) CPU1 CPU2 CPU3 Will probably read OLD data

Data dependency analysis Another question: Are the following loops parallelizable? do i=3,n,2 a(i) = a(i-1) enddo YES! do i=1,n s = s + a(i) enddo Depends!

Data dependency analysis do i=3,n,2 a(i) = a(i-1) enddo YES! cycle1 cycle2 A(9)=A(8) A(11)=A(10) A(13)=A(12) A(3)=A(2) A(5)=A(4) A(7)=A(6) CPU1 CPU2 CPU3

Data dependency analysis Data dependency analysis do i=1,n s = s + a(i) enddo Depends! cycle1 cycle2 S=S+A(4) S=S+A(5) S=S+A(6) S=S+A(1) S=S+A(2) S=S+A(3) CPU1 CPU2 CPU3 • The value of S will be undetermined and typically it will vary • from one run to the next • - This bug in parallel programming is called a “race condition”

Data dependency analysis What is the principle involved here? The examples shown fall into two categories: • Data being read is independent of data that is written: • a(i) = b(i-1) i=2,3,4. . . • a(i) = a(i-1) i=3,5,7. . . 2)Data being read depends on data that is written: a(i) = a(I-1) i=2,3,4. . . s = s + a(i) i=1,2,3. . .

Data dependency analysis Here is a typical situation: Is there a data dependency in the following loop? do i = 1,n a(i) = sin(x(i)) result = a(i) + b(i) c(i) = result * c(i) enddo No! Clearly, “result” is a temporary variable that is reassigned for every iteration. Note: “result” must be a “private” variable (this will be discussed later).

Data dependency analysis Here is a (slightly different) typical situation: Is there a data dependency in the following loop? do i = 1,n a(i) = sin(result) result = a(i) + b(i) c(i) = result * c(i) enddo Yes! The value of “result” is carried over from one iteration to the next. This is the classical read/write situation but now it is somewhat hidden.

Data dependency analysis The loop could (symbolically) be rewritten: do i = 1,n a(i) = sin(result(i-1)) result(i) = a(i) + b(i) c(i) = result(i) * c(i) enddo Now substitute the expression for a(i): do i = 1,n a(i) = sin(result(i-1)) result(i) = sin(result(i-1)) + b(i) c(i) = result(i) * c(i) enddo This is really of the type “a(i)=a(i-1)” !

Data dependency analysis One more: Can the following loop be parallelized? do i = 3,n a(i) = a(i-2) enddo If this is parallelized, there will probably be different answers from one run to another. Why?

Data dependency analysis do i = 3,n a(i) = a(i-2) enddo This looks like it will be safe. cycle2 cycle1 A(5)=A(3) A(6)=A(4) A(3)=A(1) A(4)=A(2) CPU1 CPU2

Data dependency analysis do i = 3,n a(i) = a(i-2) enddo HOWEVER: what if there are 3 cpu’s and not 2? cycle1 A(3)=A(1) A(4)=A(2) A(5)=A(3) CPU1 CPU2 CPU3 In this case, a(3) is read and written in two threads at once

Single CPU RISC memory levels CPU Cache Main memory

RISC memory levels Single CPU CPU Cache Main memory

RISC memory levels Multiple CPU’s CPU 0 Cache 0 CPU 1 Cache 1 Main memory

Multiple CPU’s RISC memory levels CPU 0 Cache 0 CPU 1 Cache 1 Main memory

RISC Memory Levels Multiple CPU’s CPU 0 Cache 0 CPU 1 Cache 1 Main memory

Definition of OpenMP • Application Program Interface (API) for • Shared Memory Parallel Programming • Directive based approach with library support • Targets existing applications and widely used • languages: • Fortran API first released October 1997 • C, C++ API first released October 1998 • -Multi-vendor/platform support

Why was OpenMP developed? • Parallel programming before OpenMP • * Standards for distributed memory (MPI and PVM) • * No standard for shared memory programming • Vendors had different directive-based API for SMP • * SGI, Cray, Kuck&Assoc, DEC • * Vendor proprietary, similar but not the same • * Most were targeted at loop level parallelism • Commercial users, high end software vendors, • have big investment in existing codes • End result: users wanting portability were forced • to use MPI even for shared memory • * This sacrifices built-in SMP hardware benefits • * Requires major effort

The Spread of OpenMP Organization: Architecture review board Web site: www.openmp.org Software: Portland (PGI) NAG Intel Kuck & Assoc (KAI) Absoft Hardware: HP/DEC IBM Intel SGI Sun

OpenMP Interface model Directives And Pragmas Runtime Library routines Environment variables • Control structures • Work sharing • Data scope attributes • * private,firstprivate, • lastprivate • * shared • * reduction -Control and query * number thread * nested parallel? * throughput mode - Lock API • Runtime environment • * schedule type • * max number threads • * nested parallelism • * throughput mode

OpenMP execution model OpenMP programs starts in a single thread, sequential mode To create additional threads, user opens a parallel region * additional slave threads launched * master thread is part of team * threads “disappear” at the end of parallel region run This model is repeated as needed Parallel: 4 threads Parallel: 2 threads Parallel: 3 threads Master thread

Creating parallel threads Fortran c$omp parallel [clause,clause] code to run in parallel c$omp end parallel C/C++ #pragma omp parallel [clause,clause] { code to run in parallel } Replicate execution i=0 i=0 C$omp parallel call foo(i,a,b) C$omp end parallel print*,i foo foo foo foo print*,i Number of threads: set in library or environment call

OpenMP on the Origin 2000 Switches, formats f77 -mp c$omp parallel do c$omp+shared(a,b,c) OR c$omp parallel do shared(a,b,c) Conditional compilation c$ iam = omp_get_thread()+1

OpenMP on the Origin 2000 -C Switches, formats cc -mp #pragma omp parallel for\ shared(a,b,c) OR #pragma omp parallel for shared(a,b,c)

OpenMP on the Origin 2000 Parallel Do Directive c$omp parallel do private(I) do I=1,n a(I)= I+1 enddo c$omp end parallel do --> optional Topics: Clauses, Detailed construct

OpenMP on the Origin 2000 Parallel Do Directive - Clauses shared private default(private|shared|none) firstprivate lastprivate reduction({operator|intrinsic}:var) schedule(type,[chunk]) if(scalar_logical_expression) ordered copyin(var)

Allocating private and shared variables S = shared variable P = private variable S S Single thread Parallel region Single thread

Clauses in OpenMP - 1 Clauses for the “parallel” directive specify data association rules and conditional computation shared (list) - data accessible by all threads, which all refer to the same storage private (list) - data private to each thread - a new storage location is created with that name for each thread, and the of the storage are not available outside the parallel region default (private | shared | none) - default association for variables not otherwise mentioned firstprivate (list) - same as for private(list), but the contents are given an initial value from the variable with the same name, from outside the parallel region lastprivate (list) - available only for work sharing constructs - a shared variable with that name is set to the last computed value of a thread private variable in the work sharing construct

Clauses in OpenMP - 2 reduction ({op/intrinsic}:list) - variables in the list are named scalars of intrinsic type - a private copy of each variable will be made in each thread and initialized according to the intended operation - at the end of the parallel region or other synchronization point all private copies will be combined - the operation must be of one of the forms: x = x op expr x = intrinsic(x,expr) if (x.LT.expr) x = expr x++; x--; ++x; --x; where expr does not contain x Op/intrinsic Init + or - 0 * 1 .AND. .TRUE. .OR. .FALSE. .EQV. .TRUE. .NEQV. .FALSE. MAX smallest number MIN largest number IAND all bits on IOR or IEOR 0 Op Init + or - 0 * 1 & -0 | 0 ^ 0 && 1 || 0 - example: c$omp parallel do reduction(+:a,y) reduction (.OR.:s)

Clauses in OpenMP - 3 copyin(list) - the list must contain common block (or global) names tahat have been declared threadprivate - data in the master thread in that common block will be copied to the thread private storage at the beginning of the parallel region - there is no “copyout” clause – data in private common block is not available outside of that thread if (scalar_logical_expression) - when an “if” clause is present, the enclosed code block is executed in parallel only if the scalar_logical_expression is .TRUE. ordered - only for do/for work sharing constructs – the code in the ORDERED block will be executed in the same sequence as sequential execution schedule (kind,[chunk]) - only for do/for work sharing constructs – specifies scheduling discipline for loop iterations nowait - end of worksharing construct and SINGLE directive implies a synchronization\ point unless “nowait” is specified

OpenMP on the Origin 2000 Parallel Sections Directive c$omp parallel sections private(I) c$omp section block1 c$omp section block2 c$omp end parallel sections Topics: Clauses, Detailed construct

OpenMP on the Origin 2000 Parallel Sections Directive - Clauses shared private default(private|shared|none) firstprivate lastprivate reduction({operator|intrinsic}:var) if(scalar_logical_expression) copyin(var)

OpenMP on the Origin 2000 Defining a Parallel Region - Individual Do Loops c$omp parallel shared(a,b) c$omp do private(j) do j=1,n a(j)=j enddo c$omp end do nowait c$omp do private(k) do k=1,n b(k)=k enddo c$omp end do c$omp end parallel

OpenMP on the Origin 2000 Defining a Parallel Region - Explicit Sections c$omp parallel shared(a,b) c$omp section block1 c$omp single block2 c$omp section block3 c$omp end parallel

OpenMP on the Origin 2000 Synchronization Constructs master/end master critical/end critical barrier atomic flush ordered/end ordered

OpenMP on the Origin 2000 Run-Time Library Routines Execution environment omp_set_num_threads omp_get_num_threads omp_get_max_threads omp_get_thread_num omp_get_num_procs omp_in_parallel omp_set_dynamic/omp_get_dynamic omp_set_nested/omp_get_nested

OpenMP on the Origin 2000 Run-Time Library Routines Lock routines omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock omp_test_lock

Parallel Programming on the SGI Origin2000