Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture

Fei Chen Kevin B. Theobald Guang R. Gao CAPSL Electrical and Computer Engineering University of Delaware Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture Cluster 2004 Thursday, September 23rd, 2004

Outline • Introduction • Algorithm • Results • Conclusion • Future Work

Introduction The Conjugate Gradient (CG) is the most popular iterative method for solving large systems of linear equations Ax = b [1].

Introduction (continued) The matrix A is usually big and sparse, and as previous studies showed, matrix vector multiply (MVM) costs 95% CPU time and the other 5% for vector-vector products (VVP) [2]. Parallel CG Algorithm Distribute A & x among nodes ... A1x1 A2x2 Anxn Local MVM Scale variables reduction & broadcast ... Calculate new local vectors A1x1 A2x2 Anxn Redistribute new local vectors

EARTH supports fibers, which are non-preemptive and scheduled in response to dataflow-like synchronization operations. Data and control dependences are explicitly programed (Threaded-C) with EARTH operations among those fibers [4]. Introduction (continued) EARTH (Efficient Architecture for Running Threads) architecture[3]

Design objective Find a matrix blocking method which can reduce overall communication cost. Overlap communication and computation to further reduce communication cost. We proposed Two-dimensional Pipelined method with EARTH multi-threading technique. Algorithm

Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4

Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = 0 Inter-phase communication cost Ct = P (P – 1) N / P = N (P – 1) Overall communication cost C = Cn + Ct = N (P – 1) = NP

Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4

Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 1 1 1 1

Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * P * P = NP Inter-phase communication cost Ct = 0 Overall communication cost C = Cn + Ct = NP

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 3 4

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 1 1 1 1 3 4

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 3 4

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P) Inter-phase communication cost Ct = (N / P) * sqrt(P) * P = N sqrt(P) Overall communication cost C = 2 N sqrt(P)

There is no data dependence between the two halves. When the first half MVM finishes in the Execution Unit (EU) and the request to send out the first half of the result vector is writen to Event Queue (EQ), the second half MVM can start on EU immediately. EARTH system dedicates Synchronization Unit (SU) handling communication requests across the network, hence communication and computation can be overlapped. Algorithm (continued) Multi-threading Technique 1 2 1 2 3 4 3 4

Test platform Chiba City in Argonne National Laboratory (ANL) - a cluster with 256 dual CPU nodes connected with fast ethernet SEMi - a MANNA machine simulator [5] We used the same matrices as NAS parallel CG benchmark [6] Scalability Results

Scalability Results (continued) Threaded_C implementation scalability results on Chiba City

Scalability Results (continued) NAS CG(MPI) benchmark scalability results on Chiba City

Scalability Results (continued) Scalability Comparison with NAS Parallel CG on Chiba City

Scalability Results (continued) Threaded_C implementation scalability results on SEMi

With the two-dimensional pipelined method the overall communication cost can be reduced to 2 / sqrt(P) of one-dimensional blocking method (vertical or horizontal). The underlying EARTH system, which is a adaptive, event-driven multi-threaded execution model, makes it possible to overlap communication and computation in our implementation. Notable scalability improvement was achieved by implementing the two-dimensional pipelined method on EARTH multi-threaded architecture. Conclusion

Port the EARTH runtime system to clusters with Myrinet connection. Provide a set of simple programming interfaces to help users reduce the coding effort. Investigate how to use two-dimensional pipelined method and EARTH system support to improve the performance of parallel scientific computing tools. Future Work

Reference [1] Jonathan R. Shewchuk, AN INTRODUCTION TO THE CONJUGATE GRADIENT METHOD WITHOUT THE AGONIZING PAIN. [2] P. Kloos, P. Blaise, F. Mathey, OPENMP AND MPI PROGRAMMING WITH A CG ALGORITHM, page 5, CEA, http://www.epcc.ed.ac.uk/ewomp2000/Presentations/KloosSlides.pdf. [3] Herber H. J. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Guang R. Gao, and Laurie J. Hendren, A STUDY OF THE EARTH-MANNA MULTI-THREADED SYSTEM, International Journal of Parallel Programming, 24(4):319-347, August 1996. [4] Kevin B. Theobald. EARTH: AN EFFICIENT ARCHITECTURE FOR RUNNING THREADS, PhD thesis, McGill University, Montreal, Quebec, May 1999. [5] Kevin B. Theobald, SEMI: A SIMULATOR FOR EARTH, MANNA, AND I860 (VERSION 0.23), CAPSL Technical Memo 27, March 1, 1999. In ftp://ftp.capsl.udel.edu/pub/doc/memos. [6] R. C. Agarwal, B. Alpern, L. Carter, F. G. Gustavson, D. J. Klepacki, R. Lawrence, M. Zubair, HIGH-PERFORMANCE PARALLEL IMPLEMENTAIONS OF THE NAS KERNEL BENCHMARKS ON THE IBM SP22, IBM SYSTEMS JOURNAL, VOL 34, NO 2, 1995.

Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture