High Performance LU Factorization for Non-dedicated Clusters

High Performance LU Factorization for Non-dedicated Clusters and the future Grid Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo)

Background • Computing nodes on clusters/Grid are shared by multiple applications • To obtain good performance, HPC applications should struggle with • Background processes • Dynamic changing available nodes • Large latencies on the Grid

Performance limiting factor:background processes • Other processes may run on background • Network daemons, interactive shells, etc. • Many typical applcations are written in synchronous style • In such applications, delay of a single node degrades the overall performance

Performance limitng factor:Large latencies on the Grid • In the future Grid environments, bandwidth will accommodate HPC applications • Large latencies will remain to be obstacles >100ms • Synchronous applications suffer from large latencies

Available nodes change dynamically • Many HPC applications assumes that computing nodes are fixed • If applications support dynamically changing nodes, we can harness computing resources more efficiently!

Overlapping multiple iterations Written in the Phoenix model Data mapping for dynamically changing nodes Goal of this work An LU factorization algorithm that • Tolerates background processes & large latencies • Supports dynamically changing nodes A fast HPC application on non-dedicated clusters and Grid

Outline of this talk • The Phoenix model • Our LU Algorithm • Overlapping multiple iterations • Data mapping for dynamically changing nodes • Performance of our LU and HPL • Related work • Summary

Phoenix model [Taura et al. 03] • A message passing model for dynamically changing environments • Concept of virtual nodes • Virtual nodes as destinations of messages Virtual nodes Physical nodes

Overview of our LU • Like typical implementations, • Based on message passing • The matrix is decomposed into small blocks • A block is updated by its owner node • Unlike typical implementations, • Asynchronous data-driven style for overlapping multiple iterations • Cyclic-like data mapping for any & dynamically changing number of nodes • (Currently, pivoting is not performed)

LU factorization for (k=0; k<B; k++) { Ak,k=fact(Ak,k); for (i=k+1; i<B; i++) Ai,k=update_L(Ai,k,Ak,k); for (j=k+1; j<B; j++) Ak,j=update_U(Ak,j,Ak,k); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) Ai,j=Ai,j– Ai,k x Ak,j; } Diagonal L part U part Trail part

Naïve implementation and its problem • Iterations are separated Not tolerant to latencies/background processes! (k+1) th iteration (k+2) th iteration k th iteration # of executable tasks time Diagonal U L trail

time Latency Hiding Techniques • Overlapping iterations hides latencies • Diagonal/L/U parts is advanced • If computations of trail parts are separated, only adjacent two iterations are overlapped There is room for further improvement

time Overlapping multiple iterations for more tolerance • We overlap multiple iterations • by computing all blocks, including trail parts asynchronously • Data driven style & prioritizedtask scheduling are used

Prioritized task scheduling • We assign a priority to updating task of each block • k-th update of block Ai,j has a priority of min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth • We can control overlapping by changing the value of S

P0 P1 P2 P3 P4 P5 Typical data mapping and its problem • Two dimensional block cyclic distribution matrix • Good load balance and small communication, but • The number of nodes must be fixed and factored into two small numbers How to support dynamically changing nodes?

A00 A01 A02 A03 A04 A05 A06 A07 A00 A01 A02 A03 A04 A05 A06 A07 A00 A01 A02 A03 A04 A05 A06 A07 A44 A00 A43 A01 A47 A02 A41 A03 A45 A04 A40 A05 A46 A06 A42 A07 A10 A11 A12 A13 A14 A15 A16 A17 A10 A11 A12 A13 A14 A15 A16 A17 A10 A11 A12 A13 A14 A15 A16 A17 A34 A10 A33 A11 A37 A12 A31 A13 A35 A14 A30 A15 A36 A16 A32 A17 A20 A21 A22 A23 A24 A25 A26 A27 A20 A21 A22 A23 A24 A25 A26 A27 A20 A21 A22 A23 A24 A25 A26 A27 A74 A20 A73 A21 A77 A22 A71 A23 A75 A24 A70 A25 A76 A26 A72 A27 A44 A43 A47 A41 A45 A40 A46 A42 A30 A31 A32 A33 A34 A35 A36 A37 A30 A31 A32 A33 A34 A35 A36 A37 A30 A31 A32 A33 A34 A35 A36 A37 A14 A30 A13 A31 A17 A32 A11 A33 A15 A34 A10 A35 A16 A36 A12 A37 Random Permutation A34 A33 A37 A31 A35 A30 A36 A32 A40 A41 A42 A43 A44 A45 A46 A47 A40 A41 A42 A43 A44 A45 A46 A47 A40 A41 A42 A43 A44 A45 A46 A47 A54 A40 A53 A41 A57 A42 A51 A43 A55 A44 A50 A45 A56 A46 A52 A47 A74 A73 A77 A71 A75 A70 A76 A72 A50 A51 A52 A53 A54 A55 A56 A57 A50 A51 A52 A53 A54 A55 A56 A57 A50 A51 A52 A53 A54 A55 A56 A57 A04 A50 A03 A51 A07 A52 A01 A53 A05 A54 A00 A55 A06 A56 A02 A57 A14 A13 A17 A11 A15 A10 A16 A12 A60 A61 A62 A63 A64 A65 A66 A67 A60 A61 A62 A63 A64 A65 A66 A67 A60 A61 A62 A63 A64 A65 A66 A67 A64 A60 A63 A61 A67 A62 A61 A63 A65 A64 A60 A65 A66 A66 A62 A67 A54 A53 A57 A51 A55 A50 A56 A52 A70 A71 A72 A73 A74 A75 A76 A77 A70 A71 A72 A73 A74 A75 A76 A77 A70 A71 A72 A73 A74 A75 A76 A77 A24 A70 A23 A71 A27 A72 A21 A73 A25 A74 A20 A75 A26 A76 A22 A77 A04 A03 A07 A01 A05 A00 A06 A02 A64 A63 A67 A61 A65 A60 A66 A62 Permuted matrix A24 A23 A27 A21 A25 A20 A26 A22 Our data mapping for dynamically changing nodes • Permutation is common among all nodes Original matrix

original original original original original original original permuted permuted permuted permuted permuted permuted permuted Dynamically joining nodes • A new node sends a steal message to one of nodes • The receiver abandons some virual nodes, and sends blocks to the new node • The new node undertakes virtual nodes and blocks • For better load balance, stealing process is repeated

Experimental environments (1) • 112 nodes IBM BladeCenter Cluster • Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes • 1 CPU per node is used • Slower CPU (2.4GHz) determines the overall performance • Gigabit ethernet

Experimental environments (2) • High performance Linpack (HPL) is by Petitet et al. • GOTO BLAS is made by Kazushige Goto (UT-Austin) • Ours (S=0): don’t overlap explicitly • Ours (S=1): overlap with an adjacent iteration • Ours (S=5): overlap multiple (5) iterations

Scalability x72 • Ours(S=5) achieves 190 GFlops with 108 nodes • 65 times speedup • Matrix size N=61440 • Block size NB=240 • Overlap depth S=0 or 5 x65

Tolerance to background processes (1) • We run LU/HPL with background processes • We run 3 background processes per randomely chosen node • The background processes are short term • They move to other random nodes every 10 secs

Tolerance to background processes (2) -16% -26% -31% • HPL slows down heavily • Ours(S=0) and Ours(S=1) also suffer • By overlapping multiple iterations (S=5), Our LU becomes more tolerant ! -36% • 108 nodes for computation • N=46080

Tolerance to large latencies (1) • We emulate the future Grid environment with high bandwidth & large latencies • Experiments are done on a cluster • Large latencies are emulated by software • +0ms, +200ms, +500ms

Tolerance to large latencies (2) • S=0 suffers by 28% • Overlapping of iterations makes our LU more tolerant • Both S=1 and S=5 work well -19% -20% -28% • 108 nodes for computation • N=46080

64 16 Performance with joining nodes (1) • 16 nodes at first, then 48 nodes are added dynamically

Performance with joining nodes (2) • Flexibility to the number of nodes is useful to obtain higher performance • Comared with Fixed-64, Dynamic suffers migration overhead etc. x1.9 faster • N=30720 • S=5

Related Work Dyn-MPI [Weatherly et al. 03] • An extended MPI library that supports dynamically changing nodes

Summary An LU implementation suitable for non-dedicated clusters and the Grid • Scalable • Support dynamically changing nodes • Tolerate background processes & large latencies

Future Work • Perform pivoting • More data dependencies are introduced • Is our LU still tolerant? • Improve dynamic load balancing • Choose better target nodes for stealing • Take care of CPU speeds • Apply our approach to other HPC applications • CFD applications

Thank you!

High Performance LU Factorization for Non-dedicated Clusters

High Performance LU Factorization for Non-dedicated Clusters

Presentation Transcript

6. LU Factorization

Linear Systems LU Factorization

High Performance Computing, Clusters, and Productivity

Optimizing LU Factorization in Cilk ++

Block LU Factorization Lecture 24

Parallel Simulations on High-Performance Clusters

High Performance Linux Clusters

Performance study of multi-GPU acceleration of LU Factorization

Method of LU Factorization

High-Performance Clusters part 1: Performance

Linux Clusters for High-Performance Computing

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

Hybrid QR Factorization Algorithm for High Performance Computing Architectures

High Performance Dedicated Server

High Performance Computing, Clusters, and Productivity

LU Factorization

High-Performance Clusters part 2: Generality

Hybrid QR Factorization Algorithm for High Performance Computing Architectures

High-Performance Clusters part 1: Performance

High End - High Performance - Dedicated Server