1 / 25

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers. K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST. Background. “Peta” is tremendous! Compared with “Giga or Tera” scale machines.

corine
Télécharger la présentation

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PSI-SIM: System Performance Evaluation Environment forNext-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

  2. Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?

  3. Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines • If you would like to develop a “Peta-Scale” supercomputer, it is required to… • Explore the design space bothof computation nodes and inter-connection network! • Verify the effective performance to be achieved! • So, we need a performance evaluation environment for peta-scale supercomputers!

  4. Our Goal! • Problem… • Simulations are 3-orders of magnitude slower than real machines! • “Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! • How can we bridge the gap? • Develop an efficient performance evaluation environment: PSI-SIM • Divide compute-node simulations and network simulations! • Abstract the target application program to accelerate simulation speed!

  5. Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

  6. Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

  7. What is the Skeleton Code? Original code Skeleton code foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } • Computation blocks are replaced by “Estimated” execution times! • Other modifications (e.g. reducing required memory size)

  8. Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

  9. Generating Communication Profile • BSIM-Logger • Executes the skeleton code on an existing machine • Emulates the behavior of target machine • Generates a communication profile under the assumption of a ZERO-latency ideal network • Why Fast? • Abstracted computation blocks are NOT executed (just update virtual timers) • Mask real communications, but generate accurate logs

  10. How Fast? How Accurate? ERI (Electron Repulsion Integral) Skeleton Original Original Time for logging (s) Exe. Time Predicted (s) Skeleton NAS PARALLEL FT Original Original Skeleton Time for logging (s) Exe. Time Predicted (s) Skeleton

  11. Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

  12. Fast, Flexible Interconnection Network Simulator • NSIM • Inputs the communication profile and a network configuration file • Generates a communication profile with estimated interconnect latency • Why Fast? Why Flexible? • Parallelized implementation • Support a number of parameters • Topology , Spec. of routers/switches, buffer size, and so on

  13. Performance of BSIM + NSIM Measured Predicted • Performance prediction for HPL execution @16nodes PC cluster • <120s (problem size = 5,000) @8CPU • About 9,000 MPI-Comm./s@8CPU Execution Time (s) Error=5.3% Not skeleton execution

  14. Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

  15. ANAGroupWork Viewer • Performance Indicator • Execution time after load-balance optimization • Group Work • Indicate load balance • Communication Indicator • Amount of communications per second

  16. Conclusions • PSI-SIM • Performance evaluation environment for supercomputers • BSIM+NSIM+ANA • On Going Work: Performance Prediction for • “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32CPU Cores) • “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine

  17. Backup Slides

  18. Peta-scale Performance Prediction • Assumption • HPL problem size: 3Million • #of nodes: 4K (PSI-SIMD) • BSIM: use 32 cpus (3GHz Xeon) • NSIM: 10,000 MPI-Comm./s@8CPU • How long we need to spend? • BSIM: about 300h (<2 weeks) • NSIM: about ?? • under the estimation…

  19. 予測実行時間(FT) Target machine?: rscc Used machine?: rscc 誤差 -11.3% 誤差 -11.6%

  20. 通信プロファイル時間(FT) Target machine?: rscc Used machine?: rscc 19%削減 86%削減

  21. 予測実行時間(ERI) Target machine?: rscc Used machine?: rscc 誤差 -0.6% 誤差 1.5% 誤差 -0.2%

  22. 通信プロファイル生成時間(ERI) Target machine?: rscc Used machine?: rscc 97%削減 96%削減 91%削減

  23. 実行時間の予測性能 通信レイテンシ 予測精度:94.7% 評価アプリケーションの規模増加 ⇒ 予測精度が向上

  24. シミュレーション時間(問題サイズ固定:2000)シミュレーション時間(問題サイズ固定:2000) 1,024プロセス 最近の成果(高速化)分 256プロセス 16プロセス 評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上

  25. Performance of NSIM 7.92,8.36,8.04 Accuracy:94.7% Target machine?:PSI-hexa Used machine?: PSI-hexa 114s

More Related