DATA PARALLEL LANGUAGES (Chapter 4b)

DATA PARALLEL LANGUAGES(Chapter 4b) multiC, Fortran 90, and HPF

The MultiC Language References: • “The multiC Programming Language”, Preliminary Documentation, WaveTracer, PUB-00001-001-00.80, Jan. 1991. • “The multiC Programming Language”, User Documentation, WaveTracer, PUB-00001-001-1.02, June 1992. Note: This presentation is based on the 1991 manual, unless otherwise noted. (e.g., “manuals” refers to both versions.) • MultiC is the language used the WaveTracer and the Zephyr SIMD computers. • The Zephyr is a second generation WaveTracer, but was never commercially available. • We were given 10 Zephyrs and several other incomplete Zephyrs to use for spare part • A MultiC++ was designed for their third generation computer, but neither were released. • Both MultiC and a parallel language designed for the MasPar are fairly similar to an earlier parallel language called C*. • C* was designed by Guy Steele for the Connection Machine. • All are data parallel and extensions of the C language • An assembler was also written for the WaveTracer (and probably the Zephyr). • It was intended for use only by company technicians. Data Parallel Languages

Information about assembler were released to WaveTracer customers on a “need to know” basis. • No manual was distributed but some details were recorded in a short report. • Professor Potter was given some details needed to put the ASC language on the WaveTracer • MultiC is an extension to ANSI C, as documented by the following book: • The C Programming Language, Second Edition, 1988, Kernighan & Richie. • The WaveTracer computer is called a Data Transport Computer (DTC) in manual • a large amount of data can be moved in parallel using interprocessor communications. • Primary expected uses for WaveTracer were scientific modeling and scientific computation • acoustic waves • heat flow • fluid flow • medical imaging • molecular modeling • neural networks • The 3-D applications are supported by a 3D mesh on the WaveTracer • Done by sampling a finite set of points (nodes) in space. Data Parallel Languages

WaveTracer Architecture Background • Architecture for Zephyr is fairly similar • Exceptions will be mentioned whenever known • Each board has 4096 bit-serial processors, which can be connected in any of the following ways: • 16x16x16 cube in 3D space • 64x64 square in 2D space • 4096 array in 1D space • The 3D architecture is native on the WT and the other networks are supported in hardware using primarily the 3D hardware • The Zephyr probably has a 2D network and only simulates the more expensive 3D network using system software. • WaveTracer was available in 1, 2, or 4 boards, arranged as follows: • 2 boards were arranged as a 16x32x16 cube • one cube stacked on the top of another cube • 8192 processors overall Data Parallel Languages

WaveTracer Architecture (Cont) • Four boards are arranged as a 32x32x16 cube • 16,384 processors • Arranged as two columns of stacked cubes • Computer supports automatic creation of virtual processors and network connections to connect these virtual processors. • If each processor supports k nodes, this slows down execution speed by a factor of k • Each processor performs each operation k times. • Limited by the amount of memory required for each virtual node • In practice, slowdown is usually less than k • The set of virtual processors supported by a physical processor is called its territory. Data Parallel Languages

Specifiers for MultiC Variables • Any datatype in C except pointers can be declared to be a parallel variable using the declaration multi • This replicates the data object for each processor to produce a 1,2, or 3 dimensional data object • In a parallel execution, all multi objects must have the same dimension. • The multi declaration follows the same format as ANSC C, e.g multi int imag, buffer; • The uni declaration is used to declare a scalar variable • Is the default and need not be shown. • The following are equivalent: uni int ptr; int ptr; • Bit Length Variables • can be of type uni or multi • Allows user to save memory • All operations can be performed on these bit-length values • Example: A 2 color image can be declared by multi unsigned int image :1; and an 8 color image by multi unsigned int picture:3; Data Parallel Languages

Some Control Flow Commands • For uni type data structures, control flow in MultiC is identical to that in ANSI C. • The parallel IF-ELSE Statement • As in ASC, both the IF and ELSE portions of the code is executed. • As with ASC, the IF is a mask-setting operation rather than a branching command • FORMAT: Same as for C • WARNING: In contrast to ASC, both sets of statements are executed. • Even if no responders are active in one part, the sequential commands in that part are executed. • Example: count := count + 1; • The parallel WHILE statement • The format used is while(condition) • The repetition continues as long as ‘condition’ is satisfied by one or more responders. • Only those responders (i.e., ones who satisfies ‘condition’ preceding to this pass through the body of ‘while”) are active during the execution of the body of the ‘while’. Data Parallel Languages

Other Commands • Jump Statements • goto, return, continue, break • These commands are in conflict with structured programming and should be used with restraint. • Parallel Reduction Operators *= Accumulative Product /= Reciprocal Accumulative Product += Accumulative Sum -= Negate & then Accumulative Sum &= Accumulative bitwise AND |= Accumulative bitwise OR >?= Accumulative Maximum <?= Accumulative Minimum • Each of the above reduction operations return a uni value and provide a powerful arithmetic operation. • Each accumulative operation would otherwise require one or more ANSI C loop constructs. • Example: If A is a multi data type largest_value = >?= A smallest_value = <?= A Data Parallel Languages

Data Replication • Example: multi int A = 0; - - - A = 2; • First statement stores 0 in every A field (compile time) • Last statement stores 2 in A field of every active PE. • Interprocessor Communications • Operators have the form [dx; dy; dz]m • This operator can shift the components of the multi variable m of all active processors along one or more coordinate dimensions. • Example: A = [-1; 2; 1]B • Causes each active processor to move the data in its B field to the A field of the processor at the following location: • one unit in the negative X direction • two units in the positive Y direction • one unit in the positive Z direction • Coordinate Axes Y Z O X Data Parallel Languages

Conventions: • If value of dz operator is not specified, it is assumed to be 0 • If the values of dy and dz operators are not specified, both are assumed to be 0 • Example: [x; y]V is the same as [x; y; 0]V • Inactive processor actions • Does not send its data to another processor • Participates in moving the data from other processors along. • Transmission of data occurs in lock step (SIMD fashion) without congestion or buffering. • Coordinate Functions • Used to return a coordinate for each active virtual processor. • Format: multi_x(), multi_y(), and multi_z() • Example: If(multi_x() = = 0 && multi_y = = 2 && multi_z = = 1) u = += A; • Note that all processors except the one at (0,2,1) are inactive with the body of the IF. • The accumulated sum of the active components of the multivariable A is just the value of the component of A at processor (0,2,1) • Effect of this example is to store the value in A at (0,2,1) in the uni variable u. Data Parallel Languages

If the second command in the example is changed to A = u; the effect is to store the contents of the uni variable u into multi variable A at location (0,2,1). • (see manual pg 11-13,14 for more details) • Arrays • Multi-pointers are not supported. • Can not have a parallel variable containing a pointer in each component of the array. • uni pointers to multi-variables are allowed. • Array Examples: int array_1 [10]; int array_2 [5][5]; multi int array_3 [5]; • array_1 is a 1 dimensional standard C array • array_2 is a 2 dimensional standard C array • array_3 is a 1-dimensional array of multi variables • MULTI_PERFORM Command • Command gives the size of each dimension of all multi-values prior to calling for a parallel execution. • Format: multi_perform(func, xsize, ysize, zsize) • Here, “func” is the function being executed. • “xsize”, “ysize”, “zsize” are positive integers specifying the DTC network configuration. • If “zsize” is 1, then multi_perform creates a 2D grid of size “xsize  ysize” Data Parallel Languages

multi_perform is normally called within the main program. • Usually calls a subroutine that includes all of the • parallel work • parallel I/O • The main program usually includes • Opening and closing of files • Some of the scalar I/O • define and include statements • When multi_perform is called, it initializes any extern and static multi objects • In the previous example, multi_perform calls func. After func returns, the multi space created for it becomes undefined. • The perror function is extended to print error messages corresponding to errno numbers resulting from the execution of MultiC extensions. • Has the following format if(multi_perform(func,x,y,z)) perror(argv[0]); • See usage in the examples in Appendix A • More information on page 11-2 of manual • Examples in Manual • Many examples in the manual • 17 in appendices alone • Also stored under exname.mc in the MultiC package • They can be compiled and executed. Data Parallel Languages

The AnyResponder Function • Code Segment for Tallying Responders unsigned int short, tall; multi float height; load_height; /* assigns value in inches to height */ if(height >= 70) tall = += (multi int)1; else short = += (multi int)1; printf(“There are %d tall people \n”, tall); • Comments on Code Segment • Note that the construct += (multi int)1 counts the active PE (i.e., responders). • This technique avoids setting up a bit field to use to tally active PEs. • Instead sets up a temporary multi variable. • Can be used to see there is at least one responder at any given time. • Check to see if resulting sum is positive • Provides technique to define the AnyResponder function needed for associative programming Data Parallel Languages

Accessing Components from Multi Variables • Code from page 11-13 or 11-14 of MultiC manuals #include <multi.h> /* includes multi library */ #include <stdlib.h> #include <stdio.h> void work (void) { uni int a, b, c, u; multi int n; /* Code goes here to assign values to n */ /* Code goes here to assign values to a, b, c */ if (mult_x() == a && multi_y() == b && multi_z() == c) u = += n; /* Assigns value of n at PE(a,b,c) */ } int main (int argc, char, *argv[ ]) { if (multi_perform(work, 7 , 7, 7)) perror (argv[0]); exit (EXIT_SUCCESS); } • To place a value of 5 into the selected location, replace the line “u = +=n” with the line n = 5; • The capability to read or place a value in a parallel variable at a selected position is essential for multiC to execute associative programs. Data Parallel Languages

The oneof and next Functions • Function oneof provides a way of selecting one out of several active processors • Defined in Multi Struct program (A.15) in manual • Procedure is essential for associative programming. • Code for oneof: multi unsigned oneof(void):1 { /* Store the coordinate values in multi variables x and y */ multi unsigned x = multi_x(), y = multi_y(), uno:1 = 0; /* Next select processor with highest coordinate value */ if( x == >?= x) if( y == >?= y) uno = 1; return uno;} • Note that multi variable uno stores a 1 for exactly one processor and all the other coordinates of uno stores a 0. • The function oneof can be used by another procedure which is called by multi_perform. • An example of oneof being called by another procedure is given on pages A46-50 of the manuals. • Should be useable in the form if(oneof())/* Check to see if an active responder exists */ • If we assign a = >? x; b = >? y; c = >? z then (a,b,c) stores the location of the PE selected by one of . Data Parallel Languages

Preceding procedure assumed a 2D configuration of processors with z=1. • If configuration is 3D, the process of selecting the coordinates can be continued by also selecting the highest z-coordinate. • Stepping through the active PEs (i.e., next) • Provides the MultiC equivalent of the ASC next command • An additional one-bit multi integer variable called bi (for “busy-idle”) is needed. • First set bi to zero • Activate the PEs you wish to step through. • Next, have the active PEs to write a 1 into bi. • Use if(oneof()) to restrict the mask to one of the active PEs. • Perform all desired operations with active PE. • Have active PE set its bi value to 0 and then exit the preceding if statement. • Use the += (accumulative sum) operator to see if any PEs remain to be processed. • If so, return to step above calling oneof • This return can be implemented using a while loop. Data Parallel Languages

Sequential Printing of Multi Variable Values • Example: Print a block of the image 2D bit array. • A function select_int is used which will return the value of image at the specified (x,y,z) coordinate. • The printing occurs in two loops which • increments the value of x from 0 to some specified constant. • increments the value of y from 0 to some specified constant. • This example is from page 8-1 of the manuals and is used in an example on pgs A16-18 of 1991 manual and pgs A12-14 of 1992 manual. • The select_int function select_int (multi *mptr, int x, int y, int z) /* Here, *mptr is a uni pointer to type multi */ { int r if( multi_x == x && multi_y == y && multi_z == z) /* Restricts scope to the one PE at (x,y,z) */ r = |= *mptr; /* OR reduction operator transfers binary value of multi variable at (x,y,z) to the uni variable */ return r;} Data Parallel Languages

The two loops to print a block of values of the image multi variable. for( y = 0; y < ysize; y++) { for (x =0; x < xsize; x++) printf( “%d”, select_int (&image,x,y,z) printf( “\n”); } • Above technique can be adapted to print or read multi variables or part of multi variables. • Efficient as long as the number of locations accessed is small. • If I/O operations involve a large amounts of data, the more efficient data transfer functions described in manuals (Chapter 8 and Section 11.2 and 11.13) should be used. • The functions multi_fread and multi_fwrite are analogous to fwrite and fread in C. Information about them is given on pages 11-1 to 11-4 of the manuals. Data Parallel Languages

Moving Data between Uni Arrays and Multi Variables • The following functions allow the user to move data between “uni” arrays and multi variables: multi_from_uni ... multi_to_uni ... • The above “…” may be replaced with a data type such as • char • short • int • long • float • double • cfloat • cdouble • These functions are illustrated in several of the examples. Data Parallel Languages

Compiling and Executing Programs on the Zephyr • A 4k Zephyr machine is available for use in the Parallel and Associative Computing Lab. • It is presently connected to a Windows 2003 Server which supports remote desktop for interactive use. However, you may use the computer directly at the console while the lab is open • Visual Studio 2002 has been installed on the server. The MultiC language uses a “compiler wrapper” to translate MultiC code into Visual C code. • Programming the Zephyr on a Windows 2003 system is similar to that using command line programming tools in UNIX. • You can edit your program using “Edit” or “Notepad” • You can compile and create an executable using “nmake” • You can execute your program using the Visual Studio Command Shell • This is a special DOS shell that has extra path, include, and library environment variables used by the compiler and linker. Data Parallel Languages

Compiling and Executing Programs on the Zephyr • Login or use Remote Desktop Connection to zserver.cs.kent.edu • From Windows XP choose Start | Programs | Accessories | Communications | Remote Desktop Connection • Enter your login name and password and click on “OK” • Open an command window and run the DTC Monitor program • Type dtcmonitor at the command prompt. • This is a “daemon” program that serializes and controls executables using the Zephyr. • When this 100% complete, you can then execute programs on the Zephyr. • You can minimize this command shell. • Important: When you are finished enter CTRL-C to end the dtcmonitor. • Create a folder on your desktop for programs. You can copy the example Zephyr MultiC program from D:\Common\zephyrtest to your local folder and rename it for your programming assignment. Data Parallel Languages

Compiling and Executing Programs on the Zephyr • Create or edit your MultiC program using DOS edit or Windows Notepad. • From the Visual Studio Command Shell type • edit anyprog.mc • notepad anyproc.mc • Make sure that the file extension is .mc • Save your work before compiling • Modify the makefile template and change the names of the MultiC file and object file to those used in your programming assignment. • Compile and link your program by typing • nmake /f anyprog.mak • nmake (for the default Makefile) • Execute your program by typing the name of your executable at the command prompt. • When you are finished enter CTRL-C to end the dtcmonitor. Data Parallel Languages

OMIT FOR PRESENT(Multi-C Recursion) • It is possible to write recursive “multi” functions in multiC, but you have to test if there are active PEs still working. • Consider the following multiC function multi int factorial( multi int n ) { multi int r; if( n != 1 ) r = (factorial(n-1)*n); else r = 1; return( r ); } • What happens? Data Parallel Languages

OMIT FOR PRESENT (MultiC Recursion Example) • Recursion multi int factorial( multi int n ) { multi int r; /* stop calculating if every component has been computed */ if( ! |= (multi int) 1 ) return(( multi int ) 0 ); /* otherwise, continue calculating */ if( n > 1 ) r = factorial( n-1 ) * n; else r = 1; return( r ); } Data Parallel Languages

Fortran 90 and HPF(High Performance Fortran) A de facto standard for scientific and engineering computations

Fortran 90 AND HPF • References: • [19] Ian Foster, Designing and Building Parallel Programs, (online copy), chapter 7. • [8] Jordan and Alaghband, Fundamentals of Parallel Processing, Section 3.6. • Recall data parallelism refers to the concurrency that occurs when all the same operation is executed on some or all elements in a data set. • A data parallel program is a sequence of such operations. • Fortran 90 (or F90)is a data-parallel programming language. • Some job control algorithms can not be expressed in adata parallel language. • F90’s array assignment statement and array functions can be used to specify certain types of data parallel computation. • F90 forms the basis of HPF (High Performance Fortran) which augments F90 with a small set of extensions. • In F90 and HPF, the (parallel) data structure operated on are restricted to arrays. • E.g., data types such as trees, sets, etc. are not supported. • All array elements must be of the same type. • Fortran arrays can have up to 7 dimensions. Data Parallel Languages

Parallelism in F90 can be expressed explicitly, as in the array assignment statement A= B*C ! A,B,C are arrays • Compilers may be able to detect implicit parallelism, as in the following example • do I = 1,m do j = 1,n A(i,j) = B(i.,j)* C(i,j) enddo enddo • Parallel execution of above code depends on the fact that the various do-loops are independent • i.e., one loop does not write/read a variable that another loop writes/reads. • Compilation can also introduce communications operations when the computation mapped to one PE requires data mapped to another PE. • Communication operations in F90 (and HPF) are inferred by the compiler and do not need to be specified by the programmer. • These are derived by the compiler from the data decomposition specified by the programmer. • F90 allows a variety of scalar operations (i.e., defined on a single value) to be applied to an entire array. Data Parallel Languages

All F90’s unary and binary operations can be applied to arrays as well, as illustrated in below examples: real A(10,200), B(10,10), c logical L(10,20) A = B + c A = A + 1.0 A = sqrt(A) L = A .EQ. B • The function of the mask is handled in F90 by the where statement, which has two forms. • The first form uses the where to restrict array elements on which an assignment is performed • For example, the following replaces each nonzero entry of array with its reciprocal: where(x /= 0) x = 1.0/X • The second form of the where is block structured and has the form where (mask-expression) array_assignment elsewhere array_assignment end where Data Parallel Languages

Some F90 Array Intrinsic Functions • Array intrinsic functions below assume a vector version of an array is formed using “column major” ordering • Some F90 array intrinsic functions • RESHAPE(A,...) converts array A into a new array with specified shape and “fill” • PACK(A, MASK, FILL) forms a vector from masked elements of A, using “fill” as needed. • UNPACK(A,MASK, FILL) replaces masked elements with elements from FILL vector • MERGE(A, B, MASK) returns array of masked A entries and unmasked entries of B • SPREAD(A, DIM, N) replicate array A, using N using N copies to form a new array of one larger dimension • CSHIFT(A, SHIFT, DIM) column major rotation of elements of A • EOSHIFT(A,...) elements of A are shifted off the end along specified dimension, with end values with fill from either a specified scalar or array of dimension 1 less than A • TRANSPOSE(A) returns transpose of array A. • Some array intrinsic functions that perform computation • MAXVAL(A) returns the maximum value of A • MINVAL(A) returns the minimum value of A • SUM(A) returns the sum of the element of A • PRODUCT(A) product of elements of A • MAXLOC(ARRAY) indices of max value in A • MINLOC(ARRAY) indices of min value in A • MATMUL(A,B) matrix multiplication A*B • DOT_PRODUCTS(A,B) vector dot product Data Parallel Languages

The HPF Data Distribution Extension • Reference: [19] Ian Foster, Designing and Building Parallel Programs, (online copy), chapter 7. • F90 array expressions specify opportunities for parallel execution but no control over how to perform these so that communication is minimized. • HPF handling of data distribution involves three directives • The PROCESSOR directive specifies the shape and size of the array of abstract processors. • The ALIGN directive is used to align elements of different arrays with each other, indicating that they should be distributed in the same manner. • The DISTRIBUTE directive is used to distribute an object (and all objects aligned with it) onto an abstract processor array. • The data distribution directives can have a major impact on a program’s performance (but not on the results computed), affecting • Partitioning of data to processors • Agglomeration – Considering value of combining tasks to produce fewer larger tasks. • Communications required to coordinate task execution. • Mapping of tasks to processors Data Parallel Languages

HPF Data Distribution (Cont.) • Data distribution directives are recommendations to a HPF compiler, not instructions. • Compiler can ignore them if it determines that this will improve performance. • PROCESSOR directive • Creates an arrangement for abstract processors and gives this arrangement a name. • Example: !HPF$ PROCESSORS P(4,8) • Normally one abstract processor is created for each physical processor. • There could be more abstract processors than physical ones. • However, HPF does not specify a way of mapping abstract to physical processors. • ALIGN Directive • Specifies array elements that should, if possible, be mapped to the same processor. • Operations involving data objects that are aligned are likely to be more efficient due to reduced communication costs if on same PE. • EXAMPLE: real B(50), C(50) !HPF$ ALIGN C(:) WITH B(:) Data Parallel Languages

HPF Data Distribution (Cont.) • ALIGN Directive (cont.) • A “*” can be used to collapse dimensions (i.e., to match one element with many elements • Considerably flexibility is allowed in specifying which array elements are to be aligned. • Dummy variables can be used for dimensions • Integer formulas to specify offsets. • An align statement can be used to specify that elements of an array should be replicated over certain processors. • Costly if replicated arrays are updated often. • Increases communication or redundant computation. • DISTRIBUTE Directive • Indicates how data are to be distributed among processor memories. • Specifies for each dimension of an array one of three ways that the array elements will be distributed among the processors * No distribution BLOCK(n) Block distribution (default n = N/P) CYCLIC(n) Cyclic distribution (default n = 1) Data Parallel Languages

HPF Data Distribution (Cont.) • DISTRIBUTE Directive (cont.) • Block distribution divides the items/indices in that dimension into equal-sized blocks of size N/P. • Cyclic distribution maps every Pth index to the same processor. • Applies not only to the named array but also to any array that is aligned to it. • The following DISTRIBUTE directives specifies a mapping for all three arrays. !HPF$ PROCESSORS p(20) real A(100,100), B(100,100), C(100,100) !HPF$ ALIGN B(:,:) with A(:,:) !HPF$ DISTRIBUTE A(BLOCK,*) ONTO p Data Parallel Languages

HPF Concurrency • The F90 array assignment statements provide a convenient way of specifying data parallel operations. • However, this does not apply to all data parallel operations, as the array on the right hand must have the same shape as the one on the left hand side. • HPF provides two other constructs to exploit data parallelism, namely the FORALL and the INDEPENDENT directives. • The FORALL Statement • Allows a more general assignments to sections of an array. • General form is FORALL (triplet, ... , triplet, mask) assignment • Examples: FORALL (i=1:m, j=1,n) X(i,j) = i+j FORALL (i=1:n, j=1,n, i<j) Y(i,j) = 0.0 • The INDEPENDENT Directive and Do-Loops • The INDEPENDENT directive can be used to assert that the iterations of a do-loop can be performed independently, that is • They can be performed in any order • They can be performed concurrently • The INDEPENDENT directive must immediately precede the do-loop that it applies to. • Examples of independent and non-independent do-loops are given in [19, Foster, pg 258-9]. Data Parallel Languages

Additional HPF Comments • A HPF program typically consists of a sequence of calls to subroutines and functions. • The data distribution that is best for a subroutine may be different than the data distribution used in the calling program. • Two possible strategies for handling this situation are • Specify a local distribution using DISTRIBUTE and ALIGN, even if this requires expensive data movement on entering • Cost normally occurs on return as well. • Use whatever data distribution is used in the calling program, even if not optimal. This requires use of INHERIT directive. • Both F90 and HPF intrinsic functions (e.g., SUM, MAXVAL) combine data from entire arrays and involve considerable communication. • Some other F90/HPF intrinsic functions such as DOT_PRODUCT involve communciation cost only if their arguments are not aligned. • Array operations involving the FORALL statement can result in communication if the computation of a value for an element A(i) require data values that are not on the same processor (e.g., B(j)). Data Parallel Languages

DATA PARALLEL LANGUAGES (Chapter 4b)

DATA PARALLEL LANGUAGES (Chapter 4b)

Presentation Transcript

How Should We Look At Art

Chapter 4b: Connective Tissue

Parallel Computers

How Should We Look At Art

How Should We Look At Art

Chapter 4B

How Should We Look At Art

Parallel programming languages

Contemporary Languages in Parallel Computing

CSC 335 Data Communications and Networking

Lecture 4b Data Link Layer

Data Parallel Languages (Chapter 4)

Chapter 4b: z/OS Security Overview

Parallel Programming Languages

李雅珊

How Should We Look At Art

Parallel Inductor-Resistor-Capacitor (RLC) Circuits

L23: Future Parallel Programming Languages

Module 4B

Chapter 16 Parallel Data Mining

CSC 335 Data Communications and Networking

Data Parallel Pattern