1 / 16

Yoshikatsu Yoshida and Koki Maruyama

High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI ---. Yoshikatsu Yoshida and Koki Maruyama Central Research Institute of Electric Power Industry (CRIEPI). Outline. Atmosphere model, CCM3

garth
Télécharger la présentation

Yoshikatsu Yoshida and Koki Maruyama

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High performance parallel computing of climate models towards the Earth Simulator--- computing science activities at CRIEPI --- YoshikatsuYoshida and Koki Maruyama Central Research Institute of Electric Power Industry (CRIEPI)

  2. Outline • Atmosphere model, CCM3 • evaluation of vector/parallel computing performance • improvement of parallel computing performance • load balance, communication, MPI/multi-thread • performance prediction on GS40 • improvement of communication performance • Ocean model, POP • evaluation of vector/parallel computing performance • vector/parallel tuning • performance prediction on GS40 • Coupled model, CSM-1 • ported into SX-4 and being ported into VPP5000 • performance evaluation on SX-4

  3. Performance prediction of CCM3 on GS40 (1) • Method for performance prediction • communication • load imbalance • non-parallel (serial) sections and computation overheads • dependence of vector performance on vector length • CCM3.6.6 w/ 2D domain decomposition, supposed • resolution: T341 (~ 40 km) • MPI/multi-thread hybrid parallelism • e.g., north-south decomposition by MPI, and east-west decomposition by OpenMP • GS40 • 640 nodes x 8 vector PEs (8 Gflops/PE, 40 Tflops in total) • communication bandwidth: 16 GB/s, communication latency: 5 ms

  4. Performance prediction of CCM3 on GS40 (2) • Predicted wallclock time • 100-year integration needs ~10 days estimated execution rate is ~1.5 Tflops, when using 4096 PEs • comm. startup is a principal cause of performance degradation → should be improved

  5. How to improve communication performance? (1) • “all-gather” communication in the original CCM3.6.6 • each processor (or node) sends its own data to all the other processors (or nodes), and then receives data from all the other processors (or nodes). • # of communhications per PE is O(P). • Modification of “all-gather” communication • # of communications per PE is reduced from O(P) to O(log P) data processor

  6. How to improve communication performance? (2) • Estimated performance of “all-gather” communication • much improved communication performance, expected a) original “all-gather” b) modified “all-gather”

  7. Predicted performance of original and modified CCM3 • CCM3 turnaround time w/ and w/o modified “all-gather” • 100-year integration can be done within a week • 2.2 Tflops expected when using 4096 PEs b) w/ modified “all-gather” a) w/ original “all-gather”

  8. Vector computing performance of POP (Ocean model) • Vector computing performance on SX-4 • 192 x 128 x 20 grid division • vector processor of SX-4 • peak rate: 2 Gflops • length of vector register: 256 words • relatively minor modifications for vectorization resulted in good performance • POP is much suitable even for vector platforms

  9. Parallel computing performance of modified POP (1) • # of simulated years possibly integrated within a day • 192x128x20 grid division, measured on SX-4 • 1.5-fold speedup achieved by vector/parallel tunings

  10. Parallel computing performance of modified POP (2) • Parallel efficiency at various model resolutions • measured on SX-4 • efficiency on 16 PEs reaches 80% in the 768x512x20 grid case • communication in PCG solver is a performance bottleneck

  11. Performance prediction method of POP on GS40 (1) • Execution time for POP model consists of • computation time • communication time (startup, transfer) because of no significant load-imbalance (land and bottom topography are treated by mask operations) • Computation time is estimated from timing results of decomposed sub-domain • measurements were done on single processor of SX-4 • Communication time is estimated from # of communications and amount of transferred data • latency ~5 msec, bandwidth ~16 GB/s, assumed

  12. Performance prediction method of POP on GS40 (2) • POP performance on SX-4 • predicted results agreed very well with observations

  13. Predicted performance of modified POP on GS40 (1) • COWbench grid (large): 992x1280x40 • parallel efficiency ~16%, 0.8 Tflops, expected on 2048 PEs a) wallclock time per simulated day b) Tflops and parallel efficiency

  14. Predicted performance of modified POP on GS40 (2) • ~1/10 degree model (3072x2048x40) • 100-year integration can be done in 8 days when using 4096 PEs • predicted execution rate reaches 3 Tflops maximum minimum

  15. Predicted performance of modified POP on GS40 (3) • Prediction of turnaround time for POP model (~1/10 deg) • communication startup cost (latency) is a main reason for the performance degradation, even in the case of POP model

  16. Summary • Performance prediction of CCM3 on GS40 • ~ 7 days per simulated century, w/ minor modification of “all-gather” communication. 2.2 Tflops expected. • Performance evaluation of POP on SX-4 • POP code can sustain ~50% of peak rate of SX-4’s vector processor. • communication in CG solver for barotropic mode is a bottleneck of its parallel computing performance. • Performance prediction of POP on GS40 • COWbench grid (large): ~16 % efficiency, 0.8 Tflops expected. • ~1/10 degree model: 100-year can be integrated in 8 days. • In terms of CPU resource requirements, coupled simulation using • ~T341 atmosphere model and ~1/10 degree ocean model is relevant to GS40 (Simulated century takes ~10 days).

More Related