1 / 37

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program. John M. Dennis: dennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006. Motivation. Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive

fionan
Télécharger la présentation

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006

  2. Motivation • Outgrowth of PhD thesis • Memory efficient iterative solvers • Data movement is expensive • Developed techniques to improve memory efficiency • Apply Automated Memory Analysis to POP • Parallel Ocean Program (POP) solver • Large % of time • Scalability issues Petascale Computation for the Geosciences Workshop

  3. Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

  4. Automated Memory Analysis? • Analyze algorithm written in Matlab • Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required • Predictions allow: • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop

  5. POP using 20x24 blocks (gx1v3) • POP data structure • Flexible block structure • land ‘block’ elimination • Small blocks • Better {load balanced, land block elimination} • Larger halo overhead • Larger blocks • Smaller halo overhead • Load imbalanced • No land block elimination • Grid resolutions: • test: (128x192) • gx1v3 (320x384) Petascale Computation for the Geosciences Workshop

  6. 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator Alternate Data Structure Petascale Computation for the Geosciences Workshop

  7. Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

  8. Data movement • Working set load size (WSL) • (MM --> L1 cache) • Measure using PAPI (WSLM) • Compute platforms: • Sun Ultra II (400Mhz) • IBM POWER4 (1.3 Ghz) • SGI R14K (500Mhz) • Compare with prediction (WSLP) Petascale Computation for the Geosciences Workshop

  9. solver w/2D (Matlab) solver w/1D (Matlab) SLAMM Predicts WSLP > 4902 Kbytes 3218 Kbytes 1D data structure --> 34% reduction in data movement Predicting Data Movement Petascale Computation for the Geosciences Workshop

  10. Measured versus Predicted data movement Petascale Computation for the Geosciences Workshop

  11. Measured versus Predicted data movement Excessive data movement Petascale Computation for the Geosciences Workshop

  12. do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) ldelta=0 do i=1,nblocks p(:,:,i) = z(:,:,i) + p(:,:,i)* ß q(:,:,i) = A*p(:,:,i) w0=q(:,:,i)*P(:,:,i) ldelta = ldelta + lsum(w0,lmask) enddo delta=gsum(ldelta) w0 array accessed after loop! extra access of w0 eliminated Two blocks of source code PCG2+2D v1 PCG2+2D v2 Petascale Computation for the Geosciences Workshop

  13. Measured versus Predicted data movement Data movement matches predicted! Petascale Computation for the Geosciences Workshop

  14. Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

  15. Using 1D data structures in POP2 solver (serial) • Replace solvers.F90 • Execution time on cache microprocessors • Examine two CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) [D’Azevedo 93] • Grid: test • [128x192 grid points]w/(16x16) Petascale Computation for the Geosciences Workshop

  16. 56% reduction in cost/iteration Serial execution time on IBM POWER4 (test) Petascale Computation for the Geosciences Workshop

  17. Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

  18. Using 1D data structure in POP2 solver (parallel) • New parallel halo update • Examine several CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) • Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers • PCG solver • Preconditioners: • Diagonal • Hypre integration -> Work in progress Petascale Computation for the Geosciences Workshop

  19. 48% cost/iteration 27% cost/iteration Solver execution time for POP2 (20x24) on BG/L (gx1v3) Petascale Computation for the Geosciences Workshop

  20. 64 processors != PetaScale

  21. Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

  22. 0.1 degree POP • Global eddy-resolving • Computational grid: • 3600 x 2400 x 40 • Land creates problems: • load imbalances • scalability • Alternative partitioning algorithm: • Space-filling curves • Evaluate using Benchmark: • 1 day/ Internal grid / 7 minute timestep Petascale Computation for the Geosciences Workshop

  23. Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) [New] • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) [New] • Partitioning 1D array Petascale Computation for the Geosciences Workshop

  24. Partitioning with SFC Partition for 3 processors Petascale Computation for the Geosciences Workshop

  25. POP using 20x24 blocks (gx1v3) Petascale Computation for the Geosciences Workshop

  26. POP (gx1v3) + Space-filling curve Petascale Computation for the Geosciences Workshop

  27. Space-filling curve (Hilbert Nb=24) Petascale Computation for the Geosciences Workshop

  28. Remove Land blocks Petascale Computation for the Geosciences Workshop

  29. Space-filling curve partition for 8 processors Petascale Computation for the Geosciences Workshop

  30. POP 0.1 degree benchmark on Blue Gene/L Petascale Computation for the Geosciences Workshop

  31. POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley Petascale Computation for the Geosciences Workshop

  32. Conclusions • 1D data structures in Barotropic Solver • No more land points • Reduces execution time vs 2D data structure • 48% reduction in Solver time! (64 procs BG/L) • 9.5% reduction in Total time! (64 procs POWER4) • Allows use of solver/preconditioner packages • Implementation quality critical! • Automated Memory Analysis (SLAMM) • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop

  33. Conclusions (con’t) • Good scalability to 32K processors on BG/L • Increase simulation rate by 2x on 32K processors • SFC partitioning • 1D data structure in solver • Modify 7 source files • Future work • Improve scalability • 55% Efficiency 1K => 32K • Better preconditioners • Improve load-balance • Different block sizes • Improve partitioning algorithm Petascale Computation for the Geosciences Workshop

  34. Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Acknowledgements/Questions? Petascale Computation for the Geosciences Workshop

  35. Serial Execution time on Multiple platforms (test) Petascale Computation for the Geosciences Workshop

  36. 9.5% reduction Total execution time for POP2 (40x48) on POWER4 (gx1v3) Eliminate need for ~216,000 CPU hours per year @ NCAR Petascale Computation for the Geosciences Workshop

  37. POP 0.1 degree Increasing || --> Decreasing overhead --> Petascale Computation for the Geosciences Workshop

More Related