1 / 23

LFTI: A Performance Metric for Assessing Interconnect topology and routing design

LFTI: A Performance Metric for Assessing Interconnect topology and routing design. Background Innovations in interconnect topology and routing design is essential for future generation ultra-scale supercomputers. Current methods for evaluating topology and routing design are not ideal.

ralph
Télécharger la présentation

LFTI: A Performance Metric for Assessing Interconnect topology and routing design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LFTI: A Performance Metric for Assessing Interconnect topology and routing design • Background • Innovations in interconnect topology and routing design is essential for future generation ultra-scale supercomputers. • Current methods for evaluating topology and routing design are not ideal.

  2. Current methods for evaluating interconnect topology and routing design • Topology and routing are evaluated separately • Topology • Diameter, bisection bandwidth, nodal degree, etc • Not directly related to application level performance • Routing with topology • Simulation to get throughput and packet latency • Limited network sizes and numbers of scenarios • Simulation sees the tree, but not the forest. • Two kinds of metrics: simple metrics that do not directly relate to performance and detailed metrics that are too expensive to obtain.

  3. Impact of evaluation methods • Evaluation methods set the design optimization objective • Recently proposals (dragonfly, jellyfish) all have large bisection bandwidth and support certain traffic patterns effectively. • Think of how the designs are justified!! • Excellently designs with traditional metrics. • Are these designs good for typical HPC workloads? • There is no metric that can be used to compare across different topology and routing designs for HPC workloads.

  4. What kind of metrics are we looking for? • Desirable properties: • Reflect overall network performance • Simple enough that it can be computed quickly – we do not want to do simulation. • A related attempt -- effective bisection bandwidth: summarize network performance by the average performance for all bisection communication patterns. • Is this metric reflective?

  5. LFTI: LANL-FSU throughput indices • A metric for throughput performance • High level ideas • Use modeling the obtain the average throughput for one communication pattern. • Find the set of representative communication patterns to be used in the metrics • Summary the overall network performance using the average throughput performance for a large number of communication patterns common to HPC applications

  6. LFTI: LANL-FSU throughput indices • High level ideas • Once the patterns to be included is determined, LFTI can be derived from most topology and routing specifications without detailed simulation. • If an interconnect can achieve high overall performance for many common HPC patterns, it is likely that it will provide high performance for HPC workloads. • Unlike some other metrics, LFTI is much harder to cheat.

  7. LFTI: LANL-FSU throughput index • LFTI is the summary of the throughput of an interconnect for a large number of common communication patterns in HPC applications. • For each communication pattern, a metric (sustained throughput) is used that is closely related to the application level performance for that pattern to quantify the performance of the interconnect. • For a class of patterns (e.g. 2DNN patterns), the expected sustained throughput is used to quantify the performance. • LFTI is the aggregate of the performance of many classes of patterns.

  8. Computing the sustained throughput for a pattern (single path routing) • Compute the link load (number of flows going through each link) • The sustained throughput for each flow is its share of the throughput on the bottleneck link or Max-Min fairness. • The sustained throughput for the pattern is the aggregate throughput of all flows in the pattern. • Normalized with per flow throughput divided by the input link bandwidth.

  9. Computing the throughput index for a class of patterns • A throughput index for a class of patterns (e.g. 2DNN patterns) is the expected sustained throughput across all patterns of that class. • The index can be obtained by randomly sampling of a large number of patterns (e.g. 10000 patterns) • May apply some statistical method to obtain the index with confidence without sampling a large number of patterns.

  10. Communication Patterns in LFTI indices • Patterns with history • All to all, • Bisect – effective bisection bandwidth • Low-dimensional stencil patterns • 2DNN, 2DNN_DIAG, 3DNN, 3DNN_DIAG • Random patterns – for applications with unstructure mesh, adaptive mesh refinement methods • RANDOM 50, RANDOM N50 • Commonly used sub-communication patterns • Permutation, shift

  11. LFTI categories • Trying to reflect how the machine is used • Whole system direct map LFTI • Whole system random map LFTI • Job allocation trace-based LFTI • Largest job based on some job traces

  12. Evaluating interconnect using LFTI Fat-tree (ftree), dragonfly (dfly), hypercube(hcube) 6D torus (6D), 3D torus (3D), jellyfish (jfish) of 25K-35K nodes – the size of the next generation supercomputer.

  13. Throughput index and communication time

  14. Whole system direct map LFTI

  15. Whole system direct map LFTI

  16. Whole system random map LFTI

  17. Whole system random map LFTI

  18. Job allocation based

  19. Job allocation based

  20. LFTI summary

  21. Conclusion • Traditional performance metrics such as bisection bandwidth and effective bisection bandwidth are not indicative for interconnect’s performance. • Optimizing for BB and EBB may not lead to high performance interconnects. • LFTI is indicative of application level performance, yet can be derived rapidly without detailed simulation. • It is a much better metric than the current metrics.

  22. LFTI weakness • Communication patterns and weights • Heavily concentrating on simulation types of applications • Not much for data intensive applications • Calls for performance characterization work • To find the truly “representative” workload to be included in the index.

  23. LFTI weakness • LFTI relies on fast modeling of throughput performance from each communication patterns • Depending on the routing algorithm, the modeling can be problematic • Indirect adaptive routing is an example – no effective model method than simulation. • Needs to develop new models for all existing and future routing schemes, and whatever can affect the “sustained throughput”

More Related