PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

PSI-SIM: System Performance Evaluation Environment forNext-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines How are you Mr. Tera? I am fine! How about you, Mr. Peta?

Background • “Peta” is tremendous! • Compared with “Giga or Tera” scale machines • If you would like to develop a “Peta-Scale” supercomputer, it is required to… • Explore the design space bothof computation nodes and inter-connection network! • Verify the effective performance to be achieved! • So, we need a performance evaluation environment for peta-scale supercomputers!

Our Goal! • Problem… • Simulations are 3-orders of magnitude slower than real machines! • “Peta-scale” is 3-orders of magnitude larger than “Tera-scale” (i.e. available machines) ! • How can we bridge the gap? • Develop an efficient performance evaluation environment: PSI-SIM • Divide compute-node simulations and network simulations! • Abstract the target application program to accelerate simulation speed!

Performance-Evaluation Flowof PSI-SIM Parallelized Application (e.g. Peta-scale) Step1: Generate a skeleton code BSIM-Parser DB for Processors Target machine Skeleton Code Step2: Execute on an existing machine Interconnect Arch. BSIM-Logger Comm. profile (w/o Latency) Target machine Step3: Simulate inter connection network Interconnect Configuration NSIM Comm. Profile (w/ Latency) • Performance Info. Step4: Visualize and analyze the results ANA • Visualization • Hints for Optimization

What is the Skeleton Code? Original code Skeleton code foo( ) { BSIM_ADD_TIME(10ms) MPI_Comm. BSIM_ADD_TIME(1ms) BSIM_ADD_TIME(15s) } foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F for (j=0; j<n; j++) for (k=0; k<n; k++) Func( ); } • Computation blocks are replaced by “Estimated” execution times! • Other modifications (e.g. reducing required memory size)

Generating Communication Profile • BSIM-Logger • Executes the skeleton code on an existing machine • Emulates the behavior of target machine • Generates a communication profile under the assumption of a ZERO-latency ideal network • Why Fast? • Abstracted computation blocks are NOT executed (just update virtual timers) • Mask real communications, but generate accurate logs

How Fast? How Accurate? ERI (Electron Repulsion Integral) Skeleton Original Original Time for logging (s) Exe. Time Predicted (s) Skeleton NAS PARALLEL FT Original Original Skeleton Time for logging (s) Exe. Time Predicted (s) Skeleton

Fast, Flexible Interconnection Network Simulator • NSIM • Inputs the communication profile and a network configuration file • Generates a communication profile with estimated interconnect latency • Why Fast? Why Flexible? • Parallelized implementation • Support a number of parameters • Topology , Spec. of routers/switches, buffer size, and so on

Performance of BSIM + NSIM Measured Predicted • Performance prediction for HPL execution @16nodes PC cluster • <120s (problem size = 5,000) @8CPU • About 9,000 MPI-Comm./s@8CPU Execution Time (s) Error=5.3% Not skeleton execution

ANAGroupWork Viewer • Performance Indicator • Execution time after load-balance optimization • Group Work • Indicate load balance • Communication Indicator • Amount of communications per second

Conclusions • PSI-SIM • Performance evaluation environment for supercomputers • BSIM+NSIM+ANA • On Going Work: Performance Prediction for • “Tera-Scale” machine (1K CPU Cores) by using a “Giga-scale” machine (e.g. 32CPU Cores) • “Peta-Scale” machine (4K PSI-SIMD CPUs) by using a “Giga-scale” machine

Backup Slides

Peta-scale Performance Prediction • Assumption • HPL problem size: 3Million • #of nodes: 4K (PSI-SIMD) • BSIM: use 32 cpus (3GHz Xeon) • NSIM: 10,000 MPI-Comm./s@8CPU • How long we need to spend? • BSIM: about 300h (<2 weeks) • NSIM: about ?? • under the estimation…

予測実行時間(FT) Target machine?: rscc Used machine?: rscc 誤差 -11.3% 誤差 -11.6%

通信プロファイル時間(FT) Target machine?: rscc Used machine?: rscc 19%削減 86%削減

予測実行時間（ERI） Target machine?: rscc Used machine?: rscc 誤差 -0.6% 誤差 1.5% 誤差 -0.2%

通信プロファイル生成時間（ERI） Target machine?: rscc Used machine?: rscc 97%削減 96%削減 91%削減

実行時間の予測性能 通信レイテンシ予測精度：94.7% 評価アプリケーションの規模増加 ⇒ 予測精度が向上

シミュレーション時間（問題サイズ固定：2000）シミュレーション時間（問題サイズ固定：2000） 1,024プロセス最近の成果（高速化）分 256プロセス 16プロセス評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上

Performance of NSIM ７．９２，８．３６，８．０４ Accuracy：94.7% Target machine?：PSI-hexa Used machine?: PSI-hexa １１４ｓ

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Presentation Transcript

The Next Generation AMR System

Commodity Computing Clusters - next generation supercomputers?

Laboratory Evaluation of a Next Generation Transversal Ultrasound System

Performance Evaluation System

Teacher Performance Evaluation System

Teacher Performance Evaluation System

2012 Performance Evaluation System

Teacher Performance Evaluation System

Next Generation CAT System

Industry requirements for a regulatory environment for Next Generation Networks

Next Generation Air Transportation System

High Performance Computing – Supercomputers

Performance Evaluation System (PES)

Next Generation Air Transportation System

A Vision for Next Generation System Monitoring

Idaho’s Next Generation Accountability System

H igh Performance Video Solution for Next Generation

Principal Performance Evaluation System

Principal Performance Evaluation System

Building the Foundation For Next Generation Digital Learning Environment

Next Generation Air Transportation System

Next Generation Warehouse Management System