1 / 57

Parallel Algorithm Design Case Study: Tridiagonal Solvers

Parallel Algorithm Design Case Study: Tridiagonal Solvers. Xian-He Sun Department of Computer Science Illinois Institute of Technology sun@iit.edu. Outline. Problem Description Parallel Algorithms The Partition Method The PPT Algorithm The PDD Algorithm The LU Pipelining Algorithm

thyra
Télécharger la présentation

Parallel Algorithm Design Case Study: Tridiagonal Solvers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Algorithm DesignCase Study: Tridiagonal Solvers Xian-He Sun Department of Computer Science Illinois Institute of Technology sun@iit.edu CS546

  2. Outline • Problem Description • Parallel Algorithms • The Partition Method • The PPT Algorithm • The PDD Algorithm • The LU Pipelining Algorithm • The PTH Method and PPD Algorithm • Implementations CS546

  3. Problem Description • Tridiagonal linear system CS546

  4. Sequential Solver Problem (k=2, … N) Forward step (k=2, … N) Backward Step (k=N-1, … 1) CS546

  5. Partition Partitioning Decomposition Assignment Orchestration Mapping P0 P1 P2 P3 Sequentialcomputation Tasks Processes Parallelprogram Processors CS546

  6. é ù ê ú ê ú D = A ê ú ê ú ë û The Matrix Modification Formula The Partition of Tridiagonal Systems CS546

  7. = are column vector with ith element being one and all the other entries being zero. CS546

  8. The Solving process • Solve the subsystems in parallel • Solve the reduced system • Modification * * * * * * CS546

  9. The Solving Procedure CS546

  10. The Reduced System (Zy=h) Needs global communication CS546

  11. The Parallel Partition LU (PPT) Algorithm CS546

  12. Orchestration Partitioning Decomposition Assignment Orchestration Mapping P0 P1 P2 P3 Sequentialcomputation Tasks Processes Parallelprogram Processors CS546

  13. Orchestration Orchestration is implied in the PPT algorithm • Intuitively, the reduced system should be solved on one node • A tree-reduction communication to get the data • Solve • A reversed tree-reduction communication to set the results • 2 log(p) communication, one solving • In PPT algorithm (step 3) • One total data exchange • All nodes solve the reduced system concurrently • 1 log(p) communication, one solving CS546

  14. Tree Reduction (Data gathering/scattering) CS546

  15. All-to-All Total Data Exchange CS546

  16. Mapping Partitioning Decomposition Assignment Orchestration Mapping P0 P1 P2 P3 Sequentialcomputation Tasks Processes Parallelprogram Processors CS546

  17. Mapping • Try to reduce the communication • Reduce time • Reduce message size • Reduce cost: distance, contention, congestion, etc • In total data exchange • Try to make every comm. a direct comm. • Can be achieved in hypercube architecture CS546

  18. The PPT Algorithm • Advantage • Perfect parallel • Disadvantage • Increased computation (vs. sequential alg.) • Global communication • Sequential bottleneck CS546

  19. Problem Description • Parallel codes have been developed during last decade • The performances of many codes suffer in a scalable computing environment • Need to identify and overcome the scalability bottlenecks CS546

  20. Diagonal Dominant Systems CS546

  21. The Reduced System of Diagonal Dominant Systems I I • Decay Bound for Inverses of Band Matrices I I I I CS546

  22. 1 1 1 1 1 1 1 1 1 1 1 1 The Reduced communication Generally needs global communication, Decay for diagonal dominant systems CS546

  23. æ ö æ ö . . é ù é ù I I *  * *  * ç ÷ ç ÷ 1 1 . . ç ÷ ç ÷ ê ú ê ú * * 1 1 0  * I I ç ÷ ç ÷ * * *  * *  * *  0 . . ê ú ê ú ç ÷ ç ÷ 1 1 ç ÷ ç ÷ . . 1 1 ê ê ú ú *  * *  0 . . ç ç ÷ ÷ . . ê ú ê ú ç ÷ ç ÷   *  * 0  * ç ç 1 1 . . 1 1 I I *  *  ë û ë û è è ø ø * * * * * * * * Z CS546

  24. The Parallel Diagonal Dominant (PDD) Algorithm CS546

  25. Orchestration Partitioning Decomposition Assignment Orchestration Mapping P0 P1 P2 P3 Sequentialcomputation Tasks Processes Parallelprogram Processors CS546

  26. Computing/Communication of PDD Non-periodic Periodic CS546

  27. Orchestration • Orchestration is implied in the algorithm design • Only two one-to-one neighboring communication CS546

  28. Mapping • Communication has reduced • Take the special mathematical property • Formal analysis can be performed based on the mathematical partition formula • Two neighboring communication • Can be achieved on array communication network CS546

  29. The PDD Algorithm • Advantage • Perfect parallel • Constant, minimum communication • Disadvantage • Increased computation (vs. sequential alg.) • Applicability • Diagonal dominant • Subsystems are reasonably large CS546

  30. speedup Scaled Speedup of the PDD Algorithm on Paragon. 1024 System of order 1600, periodic & non-periodic Scaled Speedup of the Reduced PDD Algorithm on SP2. 1024 System of Order 1600, periodic & non-periodic CS546

  31. Problem Description • For tridiagonal systems we may need new algorithms CS546

  32. Problem Description • Tridiagonal linear systems CS546

  33. Task Order Time The Pipelined Method • Exploit temporal parallelism of multiple systems • Passing the results form solving a subset to the next before continuing • Communication is high, 3p • Pipelining delay, p • Optimal computing CS546

  34. The Parallel Two-Level Hybrid Method • PDD is scalable but has limited applicability • The pipelined method is mathematically efficient but not scalable • Combine these two algorithms, outer PDD, inner pipelining • Can combine with other algorithms too CS546

  35. The Parallel Two-Level Hybrid Method • Use an accurate parallel tridiagonal solver to solve the m super-subsystems concurrently, each with k processors • Modify PDD algorithm and consider communications only between the m super-subsystems. CS546

  36. * * * * * * * * * * * * The Partition Pipeline diagonal Dominant (PPD) algorithm CS546 SCS

  37. Evaluation of Algorithms CS546 SCS

  38. Practical Motivation • NLOM (NRL Layered Ocean Model) is a well-used naval parallel ocean simulation code (see http://www7320.nrlssc.navy.mil/global_nlom/index.html ). • Fine tuned with the best algorithms available at the time • Efficiency goes down when the number of processors increases. • Poisson solver is the identified scalability bottleneck CS546

  39. Project Objectives • Incorporate the best scalable solver, the PDD algorithm, into NLOM • Increase the scalability of NLOM • Accumulate experience for a general toolkits solution for other naval simulation codes CS546

  40. Experimental Testing • Fast Poisson solvers (FACR) (Hockney, 1965) • One of the most successful rapid elliptic solvers Tridiagonal System • Large number of systems, each node has a piece of each system • NLOM implementation, highly optimized pipelining • Burn At Both Ends (BABE), trade computation with comm. (p, 2p) CS546

  41. NLOM Implementation • NLOM has a special data structure and partition • Large number of systems, each node has a piece of each system • Pipelined method, highly optimized • Burn At Both Ends (BABE), pipelining at both sides, trade computation with comm. (p, 2p) CS546

  42. Pipelining PDD Tridiagonal solver runtime: Pipelining (square) and PDD (delta) CS546 SCS

  43. Accuracy:Circle - BABESquare - PDD Diamond - PPD CS546 SCS

  44. NLOM Application • Pipelined method is not scalable • PDD is scalable but loses accuracy, due to the subsystems are very small • Need the two-level combined method CS546

  45. PDD Pipelining PPD Trid. Solver Time: Pipelining (square), PDD (delta), PPD (circle) CS546

  46. Pipelining PPD PDD Total runtime: Pipelining (square), PDD (delta), PPD (circle) CS546 SCS

  47. Parallel Two-Level Hybrid (PTH) Method • Use an accurate parallel tridiagonal solver to solve the m super-subsystems concurrently, each with k processors, where , and solving three unknowns as given in the Step 2 of PDD algorithm. • Modify the solutions of Step 1 with Steps 3-5 of PDD algorithm, or of PPT algorithm if PPT is chosen as the outer solver. CS546

  48. Abbreviation Full Name Explanation PPT Parallel ParTition LU Algorithm A parallel solver based on rank-one modification PDD Parallel Diagonal Dominant Alg A variant of PPT for diagonal dominant system PTH Parallel Two-level Hybrid Method A novel two-level approach PPD Partition Pipelined diagonal Dominant Algorithm A PTH uses PDD and pipelining as outer/inner solver The PTH method and related algorithms CS546

  49. Perform Evaluation Evaluation of Algorithms CS546 SCS

  50. Algorithm Analysis: 1. LU-Pipelining 2. The PDD Algorithm 3. The PPD Algorithm CS546 SCS

More Related