Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Javier Cuenca Luis Pedro García Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain Jack Dongarra Innovative Computing Laboratory, University of Tennessee, USA Processes Distribution of HomogeneousParallel Linear Algebra Routines onHeterogeneous Clusters

Introduction • Automatically OptimisedLinear Algebra Software • Objective • Software capable of tuning itself according to the execution environment • Motivation • Non-expert users take decisions about computation • Software should adapt to the continuous evolution of hardware • Developing efficient code by hand consumes a large quantity of resources • System computation capabilities are very variable • Some examples of auto-tuning software: • ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...

Automatic Optimisation on Heterogeneous Parallel Systems • Two possibilities on heterogeneous systems: • HoHe: Heterogeneous algorithms (heterogeneous distribution of data). • HeHo: Homogeneous algorithms and heterogeneous assignation of processes: • A variable number of processes to each processor, depending on the relative speeds • Mapping processes  processors must be made,and without a large execution time in the decision taking • Theoreticalmodels: parameterswhich represent the characteristics of the system • The general assignation problem is NP  use of heuristic approximations

Our previous HoHo methodology • Routines model • n: problem size • SP: system parameters • Computation and communication characteristics of the system • AP: algorithm parameters • Block size, number of processors to use, logical configurations of the processors, ... (with one process per processor) • Values are chosen when the routine begins to run

Our previous HoHo methodology  Our HeHo meth. • Modifications in the routine model: • New AP: • Number of processes to generate • Mapping processes to processors • SP values changes: • More than one process per processor: Each SPi in processori as di(number of processes assigned to processor i) times higher • Implicit synchronization global valueof each of the SPi is considered as the maximum valuefrom all the processors. • The slowest process forces to the other ones to reduce their speed, waiting for it at the different synchronization points of the routine.

Our HeHo methodology: an example of routine model • LU factorisation, parallel version. Model: • SP: system parameters • k3_DGEMM, k3_DTRSM, k2_DGETF2 • ts, tw • AP: algorithm parameters • b: block size • P: number of processors • p: number of processes • Mapping pprocesses on the Pprocessors • p = r x c:logical configuration of the processes: 2D mesh

Our HeHo methodology: an example of routine model • Platforms: • SUNEt: • Five SUN Ultra 1 • One SUN Ultra 5 • Interconexion network: Ethernet • TORC (Innovative Computing Laboratory): • 21 nodes of different types • dual and single processors • Pentium II, III and 4 • AMD Athlon • Interconexion networks: • FastEthernet • Giganet • Myrinet

Our HeHo methodology: an example of routine model • Theoretical vs. Experimental time on SUNEt.n=2048

Our HeHo methodology: an example of routine model • Theoretical vs. Experimental time on TORC. n=4096

Our HeHo methodology • Our approach: Assignment tree • A limit in the height of the tree (number of processes) is necessary • Each node represents a possible solution: processesprocessors • The other APs (block size, logical topology) are chosen at each node P processors ... 1 2 3 P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ...

Our HeHo methodology • For each node: • EET(node): Estimated Execution Time • Optimization problem: finding the node with the lowest EET • LET(node): Lowest Execution Time • GET(node): Greatest Execution Time • LET and GET are lower and upper bounds of the optimum solution of the subtree below the node • LET and GET to limit the number of nodes evaluated • MEET = minevaluated_nodes {GET(node)} • If {LET (node) > MEET}  do not work below this node

Our HeHo methodology • Automatic searching strategies in the assignment tree: • Method 1: • Backtracking • GET = EET. • Method 2: • Backtraking • GET obtained with a greedy approach • Method 3: • Backtraking • GETobtained with a greedy approach • LET obtained with a greedy approach • Method 4: • Greedy method on the current assignment tree • (a combinatorial tree with repetitions) • Method 5: • Greedy method on a permutational tree with repetitions

Our HeHo methodology • Automatic searching strategies in the assignment tree: • Method 1: • Backtracking • GET = EET • LET = LETari + LETcom • LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded • LETcom = assuming the best logical topology of processes that can be obtained from this node

Our HeHo methodology • Automatic searching strategies in the assignment tree: • Method 2: • Backtracking • GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution • LET = LETari + LETcom • LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded • LETcom = assuming the best logical topology of processes that can be obtained from this node. • Fewer nodes are analyzed, but the evaluated cost per node increases

Our HeHo methodology • Automatic searching strategies in the assignment tree: • Method 3: • Backtracking • GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution • LET = LETari + LETcom • LETari= A greedy approach is used: • For each node, the child that least increases the cost of arithmetic operations is included in the solution to obtain the lowest bound • LETcom = assuming the best logical topology of processes that can be obtained from this node. • It is possible that a branch to a optimal solution will be discarded

Our HeHo methodology • Automatic searching strategies in the assignment tree: • Method 4: • Greedy method on the current assignment tree • (a combinatorial tree with repetitions) • Method 5: • Greedy method on a permutational tree with repetitions • Both methods 4 and 5: • To obtain better logical topologies of the processes: • traversal searching continues (through the best child for each node) until the established maximum level is reached.

Experimental Results • Human searching strategies in the assignment tree: • Greedy User (GU) • Use ALL the available processors • One process per processor • Conservative User (CU) • Use HALF of the available processors • One process per processor • Expert User (EU): • Use 1 processor, HALF or ALL the processors depending on the problem size • One process per processor

Experimental Results • Automatic decisions vs. Users, on SUNEt (n = 7680)

Experimental Results • Automatic decisions vs. Users, on TORC (n = 2048)

Simulations • Virtual Platforms: variations and/or increases of the real platforms: • mTORC-01 • the quantity of 17P4 is increased to 11 • Number of processors: 29. Types of processors: 4 • mTORC-02 • the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and 20 respectively. Number of processors: 50. Types of processors: 4 • mTORC-03 • the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and 10, respectively • additional processors have been included • Number of processors: 100. Types of processors: 10

Simulations • Automatic decisions vs. Users • On virtual platform: mTORC01 (n = 20000) • the quantity of 17P4 is increased to 11 • Number of processors: 29. Types of processors: 4

Experimental Results • Automatic decisions vs. Users • On virtual platform: mTORC02 (n = 20000) • the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and 20 respectively • Number of processors: 50. Types of processors: 4

Experimental Results • Automatic decisions vs. Users • On virtual platform: mTORC03 (n = 20000) • the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and 10, respectively • additional processors have been included • Number of processors: 100. Types of processors: 10

Conclusions • Extension of our previous self-optimisation methodology for homogeneous systems • On hetereogeneous systems, new decisions: • Number of processes • Mapping processes  processors • Good results with parallel LU factorisation • Same methodology could be applied to other linear algebra routines

Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters