Block LU Factorization Lecture 24

Block LU FactorizationLecture 24 MA471 Fall 2003

Example Case • Suppose we are faced with the solution of a linear system Ax=b • Further suppose: • A is large (dim(A)>10,000) • A is dense • A is full • We have a sequence of different b vectors.

Problems • Suppose we are able to compute the matrix – • It costs N2 doubles to store the matrix • E.g. for N=100,000 we require 76.3 gigabytes of storage for the matrix alone. • 32 bit processors are limited to 4 gigabytes of memory • Most desktops (even 64 bit) do not have 76.3 gigabytes • What to do?

Divide and Conquer One approach is to assume we have a square number of processors. We then divide the matrix into blocks – storing one block per processor. P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15

Back to the Linear System • We are now faced with LU factorization of a distributed matrix. • This calls for a modified LU routine which acts on blocks of the matrix. • We will demonstrate this algorithm for one level. • i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.

Constructing the Block LU Factorization A00 A01 A02 L00 0 0 U00 U01 U02 A10 A11 A12 L10 1 0 0 ?11 ?12 = * A20 A21 A22 L20 0 1 0 ?21 ?22 First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are: A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..

cont A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01 A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02 A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01 A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Summary First Stage A00 A01 A02 L00 0 0 U00 U01 U02 A10 A11 A12 L10 1 0 0 ?11 ?12 = * A20 A21 A22 L20 0 1 0 ?21 ?22 First step: LU factorize uppermost block diagonal Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0 Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)

Now Factorize Lower SE Block ?11 ?12 L11 0 U11 U12 = * ?21 ?22 L21 1 0 ??22 We repeat the previous algorithm this time on the two by two SE block.

End Result A00 A01 A02 L00 0 0 U00 U01 U02 A10 A11 A12 L10 L11 0 0 U11 U12 = * A20 A21 A22 L20 L21 L22 0 0 U22

Matlab Version

Parallel Algorithm P0 P1 P2 P0: A00 = L00*U00 (compute by L00, U00 by LU factorization) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m P3 P4 P5 P6 P7 P8

Parallel Communication P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m L00 U00 U01 U02 L10 A11 A12 L20 A21 A22

P0 P1 P2 Communication Summary L00 U00 U01 U02 P3 P4 P5 L10 A11 A12 P6 P7 P8 L20 A21 A22 P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 P0: sends L00 to P1,P2 sends U00 to P3,P6 P1: sends U01 to P4,P7 P2: sends U02 to P5,P8 P3: sends L10 to P4,P5 P4: sends L20 to P7,P8

Upshot a b c d f e g h 1st stage: 1st stage: (a) P0: sends L00 to P1,P2 sends U00 to P3,P6 (b) P1: sends U01 to P4,P7 (c) P2: sends U02 to P5,P8 (d) P3: sends L10 to P4,P5 (e) P4: sends L20 to P7,P8 (f) P4: sends L11 to P5 sends U11 to P7 (g) P1: sends U12 to P8 (h) P3: sends L21 to P8 • Notes: • I added an MPI_Barrier purely to separate the LU factorization and the backsolve. • In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait compared to compute time. • The compute part of this code can be optimized much more – making the parallelefficiency even worse.

Block Back Solve • After factorization we are left with the task of using the distributed L and U to compute the backsolve: U00 L00 U01 U02 P0 P1 P2 L10 U11 L11 U12 P3 P4 P5 L20 L21 U22 L22 P6 P7 P8 Block distribution of L and U

Recall • Given an LU factorization of A namely, L,U such that A=LU • Then we can solve Ax=b by • y=L\b • x=U\y

Distributed Back Solve L00 0 0 y0 y1 y2 b0 b1 b2 L10 L11 0 = L20 L21 L22 P0: solve L00*y0 = b0 send: y0 to P3,P6 P3: send: L10*y0 to P4 P4: solve L11*y1 = b1-L10*y0 send: y1 to P7 P6: send: L20*y0 to P8\ P7: send: L21*y1 to P8 P8: solve L22*y2 = b2-L20*y0-L21*y1 Results: y0 on P0, y1 on P4, y2 on P8 P0 P1 P2 P3 P4 P5 P6 P7 P8

Matlab Code

Back Solve After the factorization we computed a solution to Ax=b This consists of two distributed block triangular systems to solve

Barrier Between Back Solves This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..

Example Code http://www.math.unm.edu/~timwar/MA471F03/blocklu.m http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c

Block LU Factorization Lecture 24

Block LU Factorization Lecture 24

Presentation Transcript

Lecture 24

Lecture 24

6. LU Factorization

Lecture 24

Linear Systems LU Factorization

Block, April 23-24

Optimizing LU Factorization in Cilk ++

Lecture 24

Lecture 24

High Performance LU Factorization for Non-dedicated Clusters

Performance study of multi-GPU acceleration of LU Factorization

Method of LU Factorization

Lecture 11 - LU Decomposition

Sparse LU Factorization for Parallel Circuit Simulation on GPU

On implicit-factorization block preconditioners

Lecture 24

LU Factorization

Lecture 24