1.04k likes | 1.22k Vues
Performance models of heterogeneous platforms and design of heterogeneous algorithms. Constant performance models of heterogeneous processors. Heterogeneity of p rocessors T he processors run at different speeds The simplest performance model p , the number of the processors,
E N D
Performance models of heterogeneous platforms and design of heterogeneous algorithms
Constant performance models of heterogeneous processors • Heterogeneity of processors • The processors run at different speeds • The simplest performance model • p, the number of the processors, • S={s1, s2, ..., sp}, the speeds of the processors (positive constants). • The speed • Absolute: the number of computational units performed by the processor per one time unit • Relative: • Some use the execution time: Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model • Given n independent equal units of computations, • Assign the units to p (p<n) physical processors P1, P2, ..., Pp of respective speeds s1, s2, ..., sp so that the workload is best balanced • The speed is understood as the number of units of computation performed per one time unit • Simple but the most fundamental problem • Its solution is a basic building block in solutions of more complicated optimization problems Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Intuition: • The load niof Pishould be proportional to si • The overall execution time is given by • Algorithm 1: • Step 1:Initialization: Approximate the niso that and Namely, we let for . Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Algorithm 1 (ctd): • Step 2: Refining:Iteratively increment some niuntil as follows: while ( ) { find such that ; nk=nk+1; } Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Proposition 1: Algorithm 1 gives the optimal solution. • Proposition 2: The complexity of Algorithm 1 is O(p2). • Proposition 3: The complexity of Algorithm 1 can be reduced to O(p×log p) using ad hoc data structures. • Example of application: • 1D block multiplication of two dense square n×n matrices on pheterogeneous processors Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Matrices A, B, and C partitioned in horizontal slices • One-to-one mapping between the slices and the processors • Optimal partitioning of the matrices is the key step of the algorithm Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Partitioning problem • One unit of computation - the multiplication of one row of matrix A by matrix B • Its size is constant during the execution of the algorittm • n2 multiplications and n×(n-1) additions • Processor Piperforms niunits • niis the number of rows in its slice of C ( ) • Algorithm 1 solves the problem Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Revisiting Algorithm 1: If n is big enough and p<<n, • Many straightforward refining algorithms will return a satisfactory approximate solution • Round-robin incrementing, etc. • Asymptotically optimal solutions • The larger the matrix size, the closer the solutions to the optimal • Advantages of such modifications of Algorithm 1 • The complexity can be reduced to O(p) . • We can use the relative speed • The absolute speed is only needed at the refining step Heterogeneous and Grid Computing
Distribution of independent units of computation with the constant model (ctd) • Using relative speeds makes the application (implementation) of the algorithms easier • In the matrix multiplication example • The size of the computation unit increases with the increase of n • => the absolute speed will decrease with the increase of n • => the application programmer has to obtain the speed for each value of n • The relative speed often does not depend on n for a wide range of n • => the relative speed can be obtained once for some particular n and used for other values of n Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors • Typical design of heterogeneous parallel algorithms • Problem of distribution of computations in proportion to the speed of processors => Problem of partitioning of some mathematical objects (sets, matrices, graphs, etc.) • Typical partitioning problem in a generic form • Given a set of pprocessorsP1, P2, ..., Pp, the speed of each of which is characterized by a positive constant, si, • Partition a mathematical object of size n into p sub-objects of the same type so that • One-to-one mapping between the partitions and the processors Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors (ctd) • Typical partitioning problem in a generic form (ctd) • ( ni is the size of i-th partition) • The volume of computation is assumed proportional to the size of the processed mathematical object • The notion of approximate proportionality is supposed to be defined for each problem (otherwise, any partitioning is ok) • The partitioning satisfies some additional restrictions on the relationship between the partitions • For example, sub-matrices may be required to form columns Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors (ctd) • Typical partitioning problem in a generic form (ctd) • The partitioning minimizes some functional(s) • The functional estimates each partitioning • For example, the sum of half-perimeters of sub-matrices may estimate the total volume of communication for some algorithms • The problem of optimal distribution of independent equal computational units can be formulated as an instantiation of the generic partitioning problem. Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors (ctd) • Given a set of pprocessorsP1, P2, ..., Pp, the speed of each of which is characterized by a positive constant, si, • Partition a set of n elements into psub-sets so that • There is one-to-one mapping between the partitions and the processors. • The number of elements niin each partition approximately proportional to si, the speed of the processor owing the partition, so that • The partitioning minimizes Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors (ctd) • Another important set partitioning problem • Given a set of pprocessorsP1, P2, ..., Pp, the speed of each of which is characterized by a positive constant, si, • Given a set of nunequal elements, the weight of each of which is characterized by a positive constant, • Partition the set into p sub-sets so that • There is one-to-one mapping between the partitions and the processors • The total weight of each partition, wi, is approximately proportional to si, the speed of the processor owing the partition • The partitioning minimizes Heterogeneous and Grid Computing
Data distribution problems with constant models of heterogeneous processors (ctd) • Solution of the latter problem • Would contribute into optimal mapping or scheduling tasks on heterogeneous platforms • Known to be NP-complete • an efficient algorithm solving this problem is not likely to exist • Design of efficient algorithms giving sub-optimal solutions • Active research area • Main applications are in scheduling components of operating environments for heterogeneous platforms • Not in the design of parallel algorithms for high performance heterogeneous computing Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors • Problem of partitioning of a well-ordered set • Can occur during design of many parallel algorithms • Different algorithms => different formulations • A single particular problem has been formulated and solved so far • Occurs when designing parallel algorithms of dense matrix factorization on heterogeneous processors with constant relative speeds • By modification of homogeneous prototypes • The modification is in data distribution Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • LU factorization of a dense matrix => partitioning of a well-ordered set with a constant model • Parallel LU factorization on homogeneous processors (see handouts) Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Modifications of this parallel LU algorithm for 1-D arrangement of heterogeneous processors • Unevenly distribute column panels over the processors • The corresponding partitioning problem • Given a dense (n×b)×(n×b) matrix A • Assignn columns of size n×bof the matrix A to p(n>p) heterogeneous processors P1, P2, ..., Pp of relative speeds S={s1, s2, ..., sp}, so that the workload at each step of the parallel LU factorization is best balanced ( ) Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Relative speeds • Should be accurately measurable and constant • The relative speed si of processor Piis obtained by normalization of its (absolute) speed vi, • vi is the number of column panels updated by Piper one time unit • vi will increase with each next step of the LU factorization • si are assumed to be constant • Optimal solution minimizes for each step k of the LU factorization, Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Why this formulation? • Should minimize the overall execution time • If a solution minimizes for each k, it will minimize the sum • : • We will see that such an allocation always exists Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Proposition 4. The solution of the problem of partitioning of a well-ordered set always exists. • Proof. The following 2 algorithms always return an optimal solution. • Dynamic Programming (DP) algorithm. • Input:p, n, S={s1, s2, ..., sp} ( ) • Output: • c, an integer array of size p, the i-th element of which contains the number of elements assigned to processor i, • d, an integer array of size n, the i-th element of which contains the processor to which the element aiis assigned Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • DP algorithm: (c1,…,cp) = (0,…,0); (d1,…,dn) = (0,…,0); for(k=1; k≤n; k=k+1) { Costmin = ∞; for(i=1; i<=p; i=i+1) { Cost = (ci+1)/si; if (Cost<Costmin) { Costmin = Cost; j=i;} } dn-k+1 = j; cj = cj+1; } Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • The complexity of the DP algorithm is O(p×n). • The DP algorithm returns an optimal allocation • Not a trivial fact • At each iteration the next element of the set is allocated to one of the processors (minimizing the cost of the allocation) • There may be several processors with the same cost • The selection between them is random • Not obvious that allocation of the element to any of them will result in a globally optimal allocation • For this partitioning problem, this has been proved. Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Reverse algorithm • Generates the optimal distribution of the elements of the set over p heterogeneous processors for each k=1,…,n() • Allocates the elements to the processors by comparing these distributions • The algorithm extracts the optimal allocation from a sequence of optimal distributions of the elements for successive subsets Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Reverse algorithm • Input:p, n, S={s1, s2, ..., sp} ( ) • Output: d, an integer array of size n, the i-th element of which contains the processor to which the element ai is assigned • The summary of the Reverse algorithm (see handouts) • HSP(p, n, S) returns the optimal distribution of n elements over p heterogeneous processors of the relative speeds S={s1, s2, ..., sp} by applying Algorithm 1. • First, we find the optimal distributions for A(1) and A(2) • If they differ only for one processor, we assign a1 to this processor Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • Reverse algorithm (ctd) • If they differ for more than one processor, we postpone allocation of a1 and find the optimal distribution for A(3)and compare it with the optimal distribution for A(1). • If the number of elements distributed to each processor for A(3)does not exceed that for A(1), we allocate a1 and a2 so that the distribution for each next subset is obtained from the distribution for the immediate previous subset by addition of one more element to one of the processors. • If not, we delay allocation of the first two elements and find the optimal distribution for A(4)and so on . Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • The Reverse algorithm returns an optimal solution. • If the algorithm assigns element akat each iteration, then the resulting allocation will be optimal by design • If a group of elements is assigned, it can be proved each incremental distribution will be optimal. • The complexity of the Reverse algorithm is • At each iteration, we apply the HSP ( ). • Testing the first condition is of complexity O(p). • Testing the second condition is of complexity O(p). • The total number of iterations of the inner loop cannot exceed the total number of allocations of elements, n. • : Heterogeneous and Grid Computing
Partitioning well-ordered sets with constant models of processors (ctd) • The complexity of the DP algorithm is a bit better. • The Reverse algorithm better suits for extensions to more complicated, non-constant, performance models of heterogeneous processors. • Other known algorithms of partitioning of well-ordered sets do not guarantee the return of an optimal solution. Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors • Matrices • Most widely used math. objects in scientific computing • Studied data partitioning problems mainly deal with matrices • Matrix partitioning in one dimension over a 1D arrangement of processors • often reduced to partitioning sets or well-ordered sets • Design of algorithms often results in matrix partitioning problems not imposing the restriction of partitioning in one dimension • E.g., in parallel linear algebra for heterogeneous platforms • We will use matrix multiplication • A simple but very important linear algebra kernel Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • A heterogeneous matrix multiplication algorithm • A modification of some homogeneous one • Most often, of the 2D block cyclic ScaLAPACK algorithm Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) • The matrices are identically partitioned into rectangular generalized blocks of the size (p×r)×(q×r) • Each generalized block forms a 2D p×q grid of r×r blocks • There is 1-to-1 mapping between this grid of blocks and the p×qprocessor grid • At each step of the algorithm • Each processor not owing the pivot row and column receives horizontally (n/p)×relements of matrix Aand vertically (n/q)×relements of matrix B • => in total, , i.e., ~ the half-perimeter of the rectangle area allocated to the processor Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • General design of heterogeneous modifications • Matrices A, B, and C are identically partitioned into equal rectangular generalized blocks • The generalized blocks are identically partitioned into rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of each rectangle is (approximately) proportional to the speed of the processor which has the rectangle • Then, the algorithm follows the steps of its homogeneous prototype Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Why to partition the GBs in proportion to the speed • At each step, updating one r×rblock of matrix C needs the same amount of computation for all the blocks • => the load will be perfectly balanced if the number of blocks updated by each processor is proportional to its speed • The number = ni×NGB • ni= the area of the GB partition allocated to i-th processor (measured in r×rblocks) • => if the area of each GB partition ~ to the speed of the owing processor, their load will be perfectly balanced Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Two new groups of algorithmic parameters • The size l×m of a generalized block • The accuracy of load-balancing • The level of potential parallelismin execution of successive steps • Optimization of the parameter – an open problem • The size and the location of the rectangular partition of the generalized block allocated to each processor • Optimal partitioning of a GB – a well studied problem • Studies have been interested in asymptotically optimal solutions • The problem could be formulated in terms of the relative speed Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • A generalized block from partitioning POV • An integer-valued rectangular • If we need an asymptotically optimal solution, the problem can be reduced to a geometrical problem of optimal partitioning of a real-valued rectangle • the asymptotically optimal integer-valued solution can be obtained by rounding off an optimal real-valued solution of the geometrical partitioning problem Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • The general geometrical partitioning problem • Given a set of p processorsP1, P2, ..., Pp, the relative speed of each of which is characterized by a positive constant, si, ( ), • Partition a unit square into p rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of the rectangle allocated to processor Pi is equal to si • The partitioning minimizes , where wi is the width and hi is the height of the rectangle allocated to processor Pi Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Motivation behind the formulation • Proportionality of the areas to the speeds • Balancing the load of the processors • Minimization of the sum of half-perimeters • Multiple partitionings can balance the load • Minimizes the total volume of communications • At each step of MM, each receiving processor receives data ~ the half-perimeter of its rectangle • => In total, the communicated data ~ Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Motivation behind the formulation (ctd) • Option: minimizing the maximal half-perimeter • Parallel communications • The use of a unit square instead of a rectangle • No loss of generality • the optimal solution for an arbitrary rectangle is obtained by straightforward scaling of that for the unit square • Proposition.The general geometrical partitioning problem is NP-complete. Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Restricted problems having polynomial solutions • Column-based • Grid-based • Column-based partitioning • Rectangles make up columns • Has an optimal solution of complexity O(p3) Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Algorithm is based on the observations • There always exists an optimal column-based partitioning with rectangles going in the non-increasing order of areas • First k columns of an optimal partitioning of a unit square will be an optimal partitioning of the corresponding rectangle of height 1. • where Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Algorithm 2:Optimal column-based partitioning a unit square between p heterogeneous processors: • First, the processors are re-indexed in the non-increasing order of their speeds, s1≥s2≥…≥sp . • The algorithm iteratively builds the optimal c-column partitioning P(c,q) of a rectangle of height 1 and width for all c=1,…,p and q=c,…,p: • P(1,q) is trivial Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Algorithm 2: (ctd) • For c>1, P(c,q) is built in two steps: • First, (q-c+1) candidate partitionings{Pj(c,q)} (j=1,…,q-c+1) constructed such that Pj(c,q) is obtained by combining the partitioning P(c-1,q-1) with the straightforward partitioning of the last, c-th, column of the width into j rectangles of the corresponding areas sq-j+1≥sq-j+2≥…≥sq . • Then, P(c,q) = Pk(c,q) where Pk(c,q) {Pj(c,q)} and minimizes the sum of half-perimeters of rectangles. • The optimal partitioning will be a partitioning from the set {P(c,q)} that minimizes the sum of half-perimeters of rectangles Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • A more restricted form of the column-based partitioning problem • The processors are already arranged into a set of columns • Algorithm 3:Optimal partitioning a unit square between p heterogeneous processors arranged into c columns, each of which is made of rjprocessors, j=1,…,c : • Let the relative speed of the i-th processor from the j-th column, Pij, be sij. • Then, we first partition the unit square into c vertical rectangular slices such that the width the j-th slice Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Algorithm 3: (ctd): • Second, each vertical slice is partitioned independently into rectangles in proportion with the speed of the processors in the corresponding processor column. • Algorithm 3 is of linear complexity. Heterogeneous and Grid Computing
Partitioning matrices with constant models of heterogeneous processors (ctd) • Grid-based partitioning problem • The heterogeneous processors form a two-dimensional grid - There exist p and q such that any vertical line crossing the unit square will pass through exactly p rectangles and any horizontal line crossing the square will pass through exactly q rectangles Heterogeneous and Grid Computing