1 / 45

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Locality-aware Connection Management and Rank Assignment for Wide-area MPI. Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007. Background. Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters. WAN.

miyo
Télécharger la présentation

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007

  2. Background • Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters WAN

  3. Requirements for Wide-area MPI • Wide-area connectivity • Firewalls and private addresses • Only some nodes can connect to each other • Perform routing using the connections that happen to be possible NAT Firewall

  4. Reqs. for Wide-area MPI (2) • Scalability • The number of conns. must be limited in order to scale to thousands of nodes • Various allocation limits of the system (e.g., memory, file descriptors, router sessions) • Simplistic schemes that may potentially result in O(n2) connections won’t scale • Lazy connect strategies work formany apps, but not for those that involve all-to-all communication

  5. Reqs. for Wide-area MPI (3) • Locality awareness • To achieve high performance with few conns, select conns. in a locality-aware manner • Many connections with nearby nodes, few connections with faraway nodes Few conns. between clusters Many conns. within a cluster

  6. Reqs. for Wide-area MPI (4) • Application awareness • Select connections according to the application’s communication pattern • Assign ranks* according to the application’s communication pattern • Adaptivity • Automatically, without tedious manual configuration * rank = process ID in MPI

  7. Contributions of Our Work • Locality-aware connection management • Uses latency and traffic information obtained from a short profiling run • Locality-aware rank assignment • Uses the same info. to discover rank-process mappings with low comm. overhead ➭ Multi-Cluster MPI (MC-MPI) • Wide-area-enabled MPI library

  8. Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  9. Grid-enabled MPI Libraries • MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielmann et al. ‘99] • Locality-aware communication optimizations • E.g., wide-area-aware collective operations (broadcast, reduction, ...) • Doesn’t work with Firewalls

  10. Grid-enabled MPI Libraries (cont’d) • MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] • Forwarding mechanisms that allow nodes to communicate even in the presence of FWs • Manual configuration • Amount of necessary config. becomes overwhelming as more resources are used Forward Firewall

  11. P2P Overlays • Pastry [Rowstron et al. ’00] • Each node maintains just O(log n) connections • Messages are routed using those connections • Highly scalable, but routing properties are unfavorable for high performance computing • Few connections between nearby nodes • Messages between nearby nodes need to be forwarded, causing large latency penalties

  12. Adaptive MPI Physical Processor Virtual Processor • Huang et al. ‘06 • Performs load balancing by migrating virtual processors • Balance the exec. times of the physical processors • Minimize inter-processor communication • Adapts to apps. by tracking the amount of communication performed between procs. • Assumes that the communication cost of every processor pair is the same • MC-MPI takes differences in communication costs into account

  13. Lazy Connect Strategies • MPICH [Gropp et al. ‘96], Scalable MPI over Infiniband [Yu et al. ‘06] • Establish connections only on demand • Reduces the number of conns. if each proc. only communicates with a few other procs. • Some apps. generate all-to-all comm. patterns, resulting in many connections • E.g., IS in the NAS Parallel Benchmarks • Doesn’t extend to wide-area environments where some communication may be blocked

  14. Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  15. Overview of Our Method Short Profiling Run • Latency matrix (L) • Traffic matrix (T) Optimized Real Run • Locality-aware connection management • Locality-aware rank assignment

  16. Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  17. Latency Matrix • Latency matrix L = {lij} • lij: latency between processes i and j in the target environment • Each process autonomously measures the RTT between itself and other processes • Reduce the num. of measurements by using the triangular inequality to estimate RTTs r if rttpr>αrttrq: rttpq=rttpr (α: constant) rttpr rttrq q p rttpq

  18. Traffic Matrix • Traffic matrix T = {tij} • tij: traffic between ranks i and j in the target application • Many applications repeat similar communication patterns ➭ Execute the application for a short amount of time and make tij the number of transmitted messages (E.g., one iteration of an iterativeapp.)

  19. Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  20. Connection Management Establishcandidateconnectionson demand Candidate connections Bounding Graph Lazy Connection Establishment Spanning Tree Application Body MPI_Init

  21. Many nearby processes far near Few faraway processes Selection of Candidate Connections • Each process selects O(log n) neighbors based on L and T • : parameter that controls connection density • n: number of processes ... /4 / /2

  22. Temporary connections Bounding Graph • Procs. try to establish temporary conns. to their selected neighbors • The collective set ofsuccessful connections ➭ Bounding graph • (Some conns. may fail due to FWs) Bounding Graph

  23. Routing Table Construction • Construct a routing table using just the bounding graph • Close the temporary connections • Conns. of the bounding graph are reestablished lazily as “real” conns. • Temporary conns. => small bufs. • Real conns. => large bufs. Bounding Graph

  24. Connect in reverse direction FW FW Send connect request using spanning tree Lazy connect fails due to FW Spanning Tree Lazy Connection Establishment FW Bounding Graph

  25. Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  26. Commonly-used Method • Sort the processes by host name (or IP address) and assign ranks in that order • Assumptions • Most communication takes place between processes with close ranks • The communication cost between processes with close host names is low • However, • Applications have various comm. patterns • Host names don’t necessarily have a correlation to communication costs

  27. Our Rank Assignment Scheme • Find a rank-process mapping with low communication overhead • Map the rank assignment problem to the Quadratic Assignment Problem • QAP • Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:

  28. Solving QAPs • NP-Hard, but there are heuristics for finding good suboptimal solutions • Library based on GRASP [Resende et al. ’96] • Test against QAPLIB [Burkard et al. ’97] • Instances of up to n = 256 • n processors for problem size n • Approximate solutions that were within one to two percent of the best known solution in under one second

  29. Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  30. Experimental Environment • Xeon/Pentium M • Linux • Intra-cluster RTT: 60-120 microsecs • TCP send/recv bufs: 256KB ea. sheepXX (64 nodes) 10.8ms chibaXXX (64 nodes) 6.8ms 6.9ms 4.4ms 4.3ms istbsXXX (64 nodes) 0.3 ms hongoXXX (64 nodes) FW

  31. Experiment 1: Conn. Management • Measure the performance of the NPB with limited numbers of connections • MC-MPI • Limit the number of connections to 10%, 20%, ..., 100% by varying  • Random • Establish a comparable number of connections randomly

  32. BT, LU, MG and SP SOR (Successive Over-Relxation) LU (Lower-Upper)

  33. BT, LU, MG and SP (2) MG (Multi-Grid) BT (Block Tridiagonal)

  34. BT, LU, MG and SP (3) • % of connections actually established was lower than that shown by the x-axis • B/c of lazy connection establishment • To be discussed in more detail later SP (Scalar Pentadiagonal)

  35. EP • EP involves very little communication EP (Embarrassingly Parallel)

  36. IS Performance decrease due to congestion! IS (Integer Sort)

  37. Experiment 2: Lazy Conn. Establish. • Compare our lazy conn. establishment method with an MPICH-like method • MC-MPI • Select  so that the maximum number of allowed connections is 30% • MPICH-like • Establish connections on demand without preselecting candidate connections(we can also say that we preselect all connections)

  38. Relative Performance Experiment 2: Results Comparable number of conns. except for IS Comparable performance except for IS Connections Established

  39. Experiment 3: Rank Assignment • Compare 3 assignment algorithms • Random • Hostname (24 patterns) • Real host names (1) • What if istbsXXX were named sheepXX, etc. (23) • MC-MPI (QAP) chibaXXX sheepXX hongoXXX istbsXXX

  40. LU and MG MG LU Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)

  41. BT and SP SP BT Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)

  42. BT and SP (cont’d) • Rank Assignment • Traffic Matrix Destination Hostname Rank MC-MPI (QAP) Rank Cluster A Cluster C Cluster B Cluster D Source

  43. EP and IS IS EP Hostname (Best) Hostname (Worst) Hostname Random QAP (MC-MPI)

  44. Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

  45. Conclusion • MC-MPI • Connection management • High performance with connections between just 10% of all process pairs • Rank assignment • Up to 300% faster than locality-unaware assignments • Future Work • An API to perform profiling w/in a single run • Integration of adaptive collectives

More Related