PL/B: A Programming Language for the Blues

PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Objective of this Project • To develop a programming system for the Blue Gene series (and distributed-memory machines in general) that is: • Convenient to use for program development and maintenance. Resulting parallel programs should be very readable – easy to modify. • Not too difficult to implement so that the system could be completed in a reasonable time. • Main focus is numerical computing.

Most Popular Parallel Programming Models Today • SPMD • Library-based • MPI • BSP • Compiler-based • Co-Array Fortran • UPC • Single thread of execution - loop/array-based • Implicit code partitioning and communication Sophisticated compiler transformations • HPF • ZPL • Explicit code partitioning and communication. Simple compiler transformations. • OpenMP

Design Principles (1 of 2) • Wanted to avoid the SPMD programming model. • In its more general form can lead to 4D spaghetti code. More structure is needed. • The most popular of the SPMD implementations, MPI, is cumbersome to use, in part due to the lack of compiler support. It has been called the “assembly language of parallel computing”. • It is difficult to reason about SPMD programs. When we reason about them, we rely on an (implicit) global view of communication and computation. • Wanted to avoid the compiler complexity and limitations that led to the failure of HPF.

Design Principles (2 of 2) • OpenMP is a good model, but Blue Gene is not a shared-memory machine. OpenMP could be implemented on top of TreadMarks, but efficient implementations would require untested, sophisticated compilers. • We propose a few language extensions and a programming model based on a single thread of execution. We would like the extensions to require only simple transformations similar to those used by OpenMP compilers. • We would like the extensions to be of a general nature so that they can be applied to any conventional language.

MATLAB Implementation • We will present next our solution in the context of MATLAB. • There are at least two reasons why developing a parallel extension to MATLAB is a good idea. • MATLAB is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms. • Programmers of parallel machines should appreciate the MATLAB environment as much as conventional programmers. • In fact, during a site visit to TJ Watson, several government representatives requested that a parallel MATLAB be included in the PERCS project. • It is relatively simple to develop a prototype of the proposed extensions in MATLAB.

PL/B Data Types and Statements • The constructs of PL/B are • A new type of objects: Hierarchically tiled arrays. • Array (and hierarchical array) operations and assignment statements similar to those found in MATLAB, Fortran 90, and APL. • Hierarchically tiled arrays (HTAs) are used • to specify data distribution in parallel programs and • to facilitate the writing of blocked sequential algorithms (for locality).

Example 1: Matrix Multiplication (1 of 4) 1. Tiled Matrix Multiplication in a Conventional Language for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end end end

Here c{i,j}, a{i,k}, b{k,j}, and T represent submatrices. The * operator represents matrix multiplication in MATLAB. Example 1: Matrix Multiplication (2 of 4) 2. Tiled Matrix Multiplication Using HTAs for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; end end

Example 1: Matrix Multiplication (3 of 4) 3. Parallel Matrix Multiplication Using HTAs for j=1:m for k=1:m c{:,j}=c{:,j}+a{:,k}*b{k,j}; end end • Use the middle product method to do parallel matrix multiply, • Assume a two dimensional processor mesh. • Assume {i,j} submatrices are stored in processor (i,j) • All communications are explicit in separate array assignment statements. • Parallel computations are also explicit in separate array assignment statements and involve no communication. % block {i,j} is in processor (i,j) for j=1:m for k=1:m % copy kth column of a r{:,j}=a{:,k}; % broadcast b{k,j} s{:,j}=b{k,j}; % compute w/o communication c{:,j}=c{:,j}+r{:,j}*s{:,j}; end end

Example 1: Matrix Multiplication (4 of 4) 4. A Second Version of Parallel Matrix Multiplication for i=1:m for j=1:m c{i,j}=dotproduct(a,b,i,j); end end function x = dotproduct(a,b,i,j) t{i,:}=b{:,j}; %communication t{i,:}=t{i,:}*a{i,:}; %computation s{i,:}=t{i,:}; for i=1:ceil(log2(size(a))) s{i,:}=eoshift(t{i,:},2^i,eye); t{i,:}=s{i,:}*t{i,:}; end x=t{i,1}; end • Use now the inner product method, • Collective operations that involve both computations and communications can be represented using intrinsic functions or user functions.

HTAs • HTAs are n-dimensional arrays that have been tiled (tiles are d-dimensional, d ≤ n). • The tiles could be recursive tiled with the tiles at each level having the same shape except for “boundary cases”. • The topmost levels could be distributed across modules of a parallel system.

Examples of HTA A(1:4,1:18) A{1:2}{1:4,1:3}(1:3) A(1:4,1:17) A{1}{1:4,1:3}{1:3}, A{2}{1:4,1:3}(1:2)

Flattening • Elements of a HTA are referenced using a tile index for each level in the hierarchy followed by an array index. Each tile index tuple is enclosed within {}s and the array index is enclosed within parentheses. • In the matrix multiplication code, c{i,j}(3,4) would represent element 3,4 of submatrix i,j. • In the first example of the previous page A{2},{4,3}(3) represents the element on the bottom right corner. • Alternatively, the tiled array could be accessed as a flat array. • In the matrix multiplication example, assuming that each tile is p × p, c(r,s) would represent element 3,4 of submatrix i,j if r=(i-1)*p+3 and s=(j-1)*p+4.

C{1,1}(1,1) C(1,1) C{1,2}(1,1) C(1,5) C{1,1}(1,2) C(1,2) C{1,1}(1,3) C(1,3) C{1,1}(1,4) C(1,4) C{1,2}(1,2) C(1,6) C{1,2}(1,3) C(1,7) C{1,2}(1,4) C(1,8) C{1,1}(2,1) C(2,1) C{1,1}(2,2) C(2,2) C{1,1}(2,3) C(2,3) C{1,1}(2,4) C(2,4) C{1,2}(2,1) C(2,5) C{1,2}(2,2) C(2,6) C{1,2}(2,3) C(2,7) C{1,2}(2,4) C(2,8) C{1,1}(3,1) C(3,1) C{1,2}(3,1) C(3,5) C{1,1}(3,2) C(3,2) C{1,1}(3,3) C(3,3) C{1,1}(3,4) C(3,4) C{1,2}(3,2) C(3,6) C{1,2}(3,3) C(3,7) C{1,2}(3,4) C(3,8) C{1,1}(4,1) C(4,1) C{1,1}(4,2) C(4,2) C{1,1}(4,3) C(4,3) C{1,1}(4,4) C(4,4) C{1,2}(4,1) C(4,5) C{1,2}(4,2) C(4,6) C{1,2}(4,3) C(4,7) C{1,2}(4,4) C(4,8) C{2,1}(1,1) C(5,1) C{2,2}(1,1) C(5,5) C{2,1}(1,2) C(5,2) C{2,1}(1,3) C(5,3) C{2,1}(1,4) C(5,4) C{2,2}(1,2) C(5,6) C{2,2}(1,3) C(5,7) C{2,2}(1,4) C(5,8) C{2,1}(2,1) C(6,1) C{2,1}(2,2) C(6,2) C{2,1}(2,3) C(6,3) C{2,1}(2,4) C(6,4) C{2,2}(2,1) C(6,5) C{2,2}(2,2) C(6,6) C{2,2}(2,3) C(6,7) C{2,2}(2,4) C(6,8) C{2,1}(3,1) C(7,1) C{2,2}(3,1) C(7,5) C{2,1}(3,2) C(7,2) C{2,1}(3,3) C(7,3) C{2,1}(3,4) C(7,4) C{2,2}(3,2) C(7,6) C{2,2}(3,3) C(7,7) C{2,2}(3,4) C(7,8) C{2,1}(4,1) C(8,1) C{2,1}(4,2) C(8,2) C{2,1}(4,3) C(8,3) C{2,1}(4,4) C(8,4) C{2,2}(4,1) C(8,5) C{2,2}(4,2) C(8,6) C{2,2}(4,3) C(8,7) C{2,2}(4,4) C(8,8) Two Ways of Referencing the Elements of an 8 x 8 Array.

Parallel Communication and Computation in PL/B (1 of 2) • PL/B programs are single-threaded and contain array operations on HTAs. The sequential part of the program • Like in HPF and MPI processors can be arranged in different forms. In MPF and MPI, processors are arranged into meshes, in PL/B we also consider (virtual) nodes so that processors can be arranged into hierarchical meshes. • The top levels of HTAs can be distributed onto a subset of nodes.

Parallel Communication and Computation in PL/B (1 of 2) • Array operations on HTAs can represent communication or computation. • Assignment statements where all HTA indices are identical are computations executed in the home of each of the HTA elements involved. • Assignment statements where this is not the case represent communication operations.

A{1,1} A{1,7} A{1,13} A{4,1} A{4,7} A{4,13} A{1,2} A{1,8} A{1,14} A{4,2} A{4,8} A{4,14} A{1,6} A{1,12} A{1,18} A{4,6} A{4,12} A{4,18} A{2,1} A{2,7} A{2,13} A{5,1} A{5,7} A{5,13} A{2,2} A{2,8} A{2,14} A{5,2} A{5,8} A{5,14} A{1,6} A{1,12} A{1,18} A{4,6} A{4,12} A{4,18} … A{3,1} A{3,7} A{3,13} A{6,1} A{6,7} A{6,13} A{3,2} A{3,8} A{3,14} A{6,2} A{6,8} A{6,14} A{1,6} A{1,12} A{1,18} A{4,6} A{4,12} A{4,18} HTA Distribution Examples (1 of 4) • Consider a 3x6 processor arrangement. • A HTA with shape {1:3,1:6}(1:10) will be distributed on the processor mesh by assigning each 10 element vector o a different processor. • A HTA with shape {1:6,1:18}(1:10) will be distributed cyclically on both dimensions:

Consider the two level processor arrangement shown on the left. It is a 3x2 arrangement of 2x2 meshes of processors. A HTA with shape {1:3,1:2}{1:2,1:2}(1:10,1:10) could be distributed by assigning each 10x10 array to a different processor. Also, the processor arrangement shown on left could be flattened so that an HTA with shape {1:6,1:4}(1:10,1:10) could be distributed by assigning each 10x10 array to a different processor HTA Distribution Examples (2 of 4)

HTA Distribution Example (3 of 4) • A HTA, b, with shape {1:3,1:2}{1:2,1:20}(1:5) would be distributed cyclically on the arrangement on the left. The pair of processors {i,j}{k,1:2} would host the following vectors: b{i,j}{k,1}(1:5) b{i,j}{k,3}(1:5) … b{i,j}{k,19}(1:5) b{i,j}{k,2}(1:5) b{i,j}{k,4}(1:5) … b{i,j}{k,20}(1:5) • A HTA, b, with shape {1:6,1:40}(1:5) would be distributed cyclically on the flattened arrangement so that processors {i,1:4} would host the following vectors: b{i,1}(1:5) b{i,5}(1:5) … b{i,17}(1:5) b{i,2}(1:5) b{i,6}(1:5) … b{i,18}(1:5) b{i,3}(1:5) b{i,7}(1:5) … b{i,19}(1:5) b{i,4}(1:5) b{i,8}(1:5) … b{i,20}(1:5)

Advantages of the PL/B Programming Model • We believe that by relying on a single thread our programming model facilitates the transition from sequential to parallel form. • In fact, a conventional program can be incrementally modified by converting one data structure at a time into distributed cell array form (cf. OpenMP).

Additional Examples • We now present several additional examples of PL/B • Except where indicated, we assume that: • Each HTA has a single level of tiling, • All HTAs have identical sizes and shapes, • The HTAs are identically distributed distributed on a processor mesh that matches in size and shape all HTAs. • Our objective in these examples is to show the expressiveness of the language for both dense and sparse computations.

Example 2: Cannon’s Algorithm (1 of 3)

Example 2: Cannon’s Algorithm (2 of 3) forall version c{1:n,1:n} = zeros(p,p); %communication forall i=1:n a{:,i} = cshift(a(i,:},i-1); %communication end forall i=1:n b{:,i} = cshift(b{:,i},i-1); %communication end for k=1:n forall i=1:n, j=1:n c{i,j} = c{i,j}+a{i,j}*b{i,j}; %computation end forall i=1:n a{i,:} = cshift(a{i,:},1); %communication end forall i=1:n b{:,i} = cshift(b{:,i},1); %communication end end

Example 2: Cannon’s Algorithm (3 of 3) Triplet version c{1:n,1:n} = zeros(p,p); %communication for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); %communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1) %communication end for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; %computation a{:,:} = cshift(a{:,:},dim=2, shift=1); %communication b{:,:} = cshift(b{:,:},dim=1,shift=1); %communication end

Example 2: The SUMMA Algorithm (1 of 6) Use now the outer-product method (n2-parallelism) Interchanging the loop headers of the loop in Example 1 produce: for k=1:n for i=1:n for j=1:n C{i,j}=C{i,j}+A{i,k}*B{k,j} end end end To obtain n2 parallelism, the inner two loops should take the form of a block operations: for k=1:n C{:,:}=C{:,:}+A{:,k}  B{k,:}; end Where the operator  represents the outer product operations

Example 2: The SUMMA Algorithm (2 of 6) • The SUMMA Algorithm C A B b11 b12 Switch Orientation -- By using a column of A and a row of B broadcast to all, compute the “next” terms of the dot product a11 a21 a11b11 a11b12 a21b11 a21b12

Example 2: The SUMMA Algorithm (3 of 6) c{1:n,1:n} = zeros(p,p); % communication for i=1:n t1{:,:}=spread(a(:,i),dim=2,ncopies=N); % communication t2{:,:}=spread(b(i,:),dim=1,ncopies=N); % communication c{:,:}=c{:,:}+t1{:,:}*t2{:,:}; % computation end

Example 2: The SUMMA Algorithm (4 of 6)

Example 4: Jacobi Relaxation while dif > epsilon v{2:,:}(0,:) = v{:n-1,:}(p,:); % communication v{:n-1,:} (p+1,:) = v{2:,:} (1,:); % communication v{:,2:} (:,0) = v{:,1:n-1}(:,p); % communication v{:, :n-1}(:,p+1) = v{:,2:}(:,1); % communication u{:,:}(1:p,1:p) = a * (v{:,:} (1:p,0:p-1) + v{:,:} (0:p-1,1:p)+ v{:,:} (1:p,2:p+1) + v{:,:} (2:p+1,1:p)) ; %computation dif=max(max( abs ( v – u ) )); v = u ; end

Example 5: Sparse MVM/copy vector (1 of 2) P1 P2 P3 A b × P4

Example 5: Sparse MVM/copy vector (1 of 2) % Distribute a c{1:n} = a(DIST(1:n):DIST(1:n+1)-1,:); % Broadcast vector b v{1:n} = b; % Local multiply t{:} = c{:} * v{:}; % Get result forall i=1:N v{i}= t(:); %flattened t end Important observation: In MATLAB sparse computations can be represented as dense computations. The interpreter only performs the necessary operations.

Example 6: Sparse MVM/distribute vector (1 of 7) P1 P2 P3 A b × P4

P2 P3 P2 P1 P4 P3 P1 P4 P4 P2 P1 P4 P1 P2 P3 P3 Example 6: Sparse MVM/distribute vector (2 of 7) Step 1: Compute set of indices needed in processor i from chunk of vector in processor j P1 P2 P3 P4

P2 P3 P3 P2 P1 P4 P4 P2 P1 P1 P3 P2 P1 P4 P4 P3 P4 P4 P4 P1 P1 P1 P4 P2 P2 P1 P2 P3 P3 P3 P3 P2 Example 6: Sparse MVM/distribute vector (3 of 7) Step 2: Perform collective communication by transposing matrix of set of indices P1 P2 P3 P4

P4 P2 P3 P3 P3 P2 P2 P1 P2 P1 P3 P1 P1 P3 P4 P4 P4 P1 P4 P1 P2 P3 P2 P1 P2 P4 P4 P1 P2 P3 P4 P3 Example 6: Sparse MVM/distribute vector (4 of 7) Step 3: In each processor, collect the vector elements needed by the other processors P1 P2 P3 P4

P4 P2 P3 P3 P3 P2 P2 P1 P2 P1 P3 P1 P1 P3 P4 P4 P4 P1 P4 P1 P2 P3 P2 P1 P2 P4 P4 P1 P2 P3 P4 P3 Example 6: Sparse MVM/distribute vector (5 of 7) Step 4: Transpose again to communicate vector elements to the processor where they are needed P1 P2 P3 P4

Example 6: Sparse MVM/distribute vector (6 of 7) c{1:n} = a(DIST(1:n):DIST(2:n+1)-1,:); v{1:n}(DIST(1:n):DIST(2:n+1)-1,1) = b(DIST(1:n):DIST(2:n+1)-1); % Step 1: P{I}{J} contains index of elements of vector v from % processor J needed by processor I. First component of P is distributed % and the second is a regular cell array. forall I=1:N P{I} =fun(c{I},DIST); end

Example 6: Sparse MVM/distribute vector (7 of 7) % Step 2: R{I}{J} contains the index of elements of vector v from % processor I needed by processor J R{:}{:}= transpose(P{:}{:}); % Step 3: Q{I}{J} contains all elements of vector v from % processor I needed by processor J forall I=1:N, k=1:N, (k != I) Q{I}{k} = v{I}(R{I}{k}(:)); end % Step 4: R{I}{J} contains all elements of vector v from % processor J needed by processor I R{:}{:} = transpose(Q{:}{:} ); forall I=1:N v{I}(P{I}(:))=R{I}(:); end v{:} = c{:} * v{:};

Example 7: SuperLU_DIST Step 1: Apply LU decomposition to corner block C{K,K}

Example 7: SuperLU_DIST Step 2: Factorize block column L{…,K} = C{…,K}/U{K,K}

Example 7: SuperLU_DIST Step 3: Factorize block row U{K,…}

Example 7: SuperLU_DIST Step 4: Update trailing Submatrix C{I,J}=C{I,J}-L{I,K}*U{K,J}

Example 7: SuperLU_DIST for K=1:N [L{K,K},U{K,K}]=lu(C{K,K}); %Step 1 L{K+1:N,K}=C{K+1:N,K}/U{K,K}; %Step 2 U{K,K+1:N}=L{K,K}\C{K,K+1:N}; %Step 3 forall I=K+1:N; L{I,K+1:}=L{I,K}; U{K+1:,I}=U{K+1:,I}; end C{K+1:,K+1:}=C{K+1:,K+1:}-L{K+1:,K+1:}*U{K+1:,K+1:}; %Step 4 end

0 0 0 0 0 0 Example 7: SuperLU_DIST Avoiding unnecessary communication

T T T T T F F F F F T T T T T F F F F F T T T T T 0 0 0 Example 7: SuperLU_DIST Tranpose & Send T T T T T T F F F F F F Spread T T T T T T F F F F F F T T T T T T

Example 7: SuperLU_DIST for K=1:N [L{K,K},U{K,K}]=lu(C{K,K}); %Step 1 T{K:N,K} = C{K:N,K} != 0; S{K:N,K}(K:N)=spread(T{K:N,K},…); R {K,K:N}=transpose(SL{K+1:N,K}{K:N}); L{K+1:N,K}=C{K+1:N,K}/U{K,K};%Step 2 U{K,K+1:N}=L{K,K}\C{K,K+1:N}; %Step 3 forall I=K+1:N; L{I,K+1:}=L{I,K}; end forall I=K=1:N U{R{K,I},I}=U{K,I}; end C{K+1:,K+1:}=C{K+1:,K+1:}-L{K+1:,K+1:}*U{K+1:,K+1:}; %Step 4 end

PL/B: A Programming Language for the Blues

PL/B: A Programming Language for the Blues

Presentation Transcript

NCAR Command Language - NCL

Introduction to PL/SQL

Integer Programming, Goal Programming, and Nonlinear Programming

C Programming

Molecular Programming

Think Java: Java Programming Language Part 1

JavaScript: Introduction to Scripting

Ziria: Wireless Programming for Hardware Dummies

Basic Features of C (Review)

Multi-Paradigm Programming in Oz for Natural Language Processing

Think Java: Java Programming Language Part 1

C ++ Programming Languages

C Programming

The Evolution of Major Programming Languages

An Overview of the X10 Programming Language and X10 Development Tools

The C - Language ready MCU

Basic Features of C (Review)

C Programming

Introduction to Assembly Language Programming

The Java Programming Language