Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007

Background • Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters WAN

Requirements for Wide-area MPI • Wide-area connectivity • Firewalls and private addresses • Only some nodes can connect to each other • Perform routing using the connections that happen to be possible NAT Firewall

Reqs. for Wide-area MPI (2) • Scalability • The number of conns. must be limited in order to scale to thousands of nodes • Various allocation limits of the system (e.g., memory, file descriptors, router sessions) • Simplistic schemes that may potentially result in O(n2) connections won’t scale • Lazy connect strategies work formany apps, but not for those that involve all-to-all communication

Reqs. for Wide-area MPI (3) • Locality awareness • To achieve high performance with few conns, select conns. in a locality-aware manner • Many connections with nearby nodes, few connections with faraway nodes Few conns. between clusters Many conns. within a cluster

Reqs. for Wide-area MPI (4) • Application awareness • Select connections according to the application’s communication pattern • Assign ranks* according to the application’s communication pattern • Adaptivity • Automatically, without tedious manual configuration * rank = process ID in MPI

Contributions of Our Work • Locality-aware connection management • Uses latency and traffic information obtained from a short profiling run • Locality-aware rank assignment • Uses the same info. to discover rank-process mappings with low comm. overhead ➭ Multi-Cluster MPI (MC-MPI) • Wide-area-enabled MPI library

Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

Grid-enabled MPI Libraries • MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielmann et al. ‘99] • Locality-aware communication optimizations • E.g., wide-area-aware collective operations (broadcast, reduction, ...) • Doesn’t work with Firewalls

Grid-enabled MPI Libraries (cont’d) • MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] • Forwarding mechanisms that allow nodes to communicate even in the presence of FWs • Manual configuration • Amount of necessary config. becomes overwhelming as more resources are used Forward Firewall

P2P Overlays • Pastry [Rowstron et al. ’00] • Each node maintains just O(log n) connections • Messages are routed using those connections • Highly scalable, but routing properties are unfavorable for high performance computing • Few connections between nearby nodes • Messages between nearby nodes need to be forwarded, causing large latency penalties

Adaptive MPI Physical Processor Virtual Processor • Huang et al. ‘06 • Performs load balancing by migrating virtual processors • Balance the exec. times of the physical processors • Minimize inter-processor communication • Adapts to apps. by tracking the amount of communication performed between procs. • Assumes that the communication cost of every processor pair is the same • MC-MPI takes differences in communication costs into account

Lazy Connect Strategies • MPICH [Gropp et al. ‘96], Scalable MPI over Infiniband [Yu et al. ‘06] • Establish connections only on demand • Reduces the number of conns. if each proc. only communicates with a few other procs. • Some apps. generate all-to-all comm. patterns, resulting in many connections • E.g., IS in the NAS Parallel Benchmarks • Doesn’t extend to wide-area environments where some communication may be blocked

Overview of Our Method Short Profiling Run • Latency matrix (L) • Traffic matrix (T) Optimized Real Run • Locality-aware connection management • Locality-aware rank assignment

Latency Matrix • Latency matrix L = {lij} • lij: latency between processes i and j in the target environment • Each process autonomously measures the RTT between itself and other processes • Reduce the num. of measurements by using the triangular inequality to estimate RTTs r if rttpr>αrttrq: rttpq=rttpr (α: constant) rttpr rttrq q p rttpq

Traffic Matrix • Traffic matrix T = {tij} • tij: traffic between ranks i and j in the target application • Many applications repeat similar communication patterns ➭ Execute the application for a short amount of time and make tij the number of transmitted messages (E.g., one iteration of an iterativeapp.)

Connection Management Establishcandidateconnectionson demand Candidate connections Bounding Graph Lazy Connection Establishment Spanning Tree Application Body MPI_Init

Many nearby processes far near Few faraway processes Selection of Candidate Connections • Each process selects O(log n) neighbors based on L and T • : parameter that controls connection density • n: number of processes ... /4 / /2

Temporary connections Bounding Graph • Procs. try to establish temporary conns. to their selected neighbors • The collective set ofsuccessful connections ➭ Bounding graph • (Some conns. may fail due to FWs) Bounding Graph

Routing Table Construction • Construct a routing table using just the bounding graph • Close the temporary connections • Conns. of the bounding graph are reestablished lazily as “real” conns. • Temporary conns. => small bufs. • Real conns. => large bufs. Bounding Graph

Connect in reverse direction FW FW Send connect request using spanning tree Lazy connect fails due to FW Spanning Tree Lazy Connection Establishment FW Bounding Graph

Commonly-used Method • Sort the processes by host name (or IP address) and assign ranks in that order • Assumptions • Most communication takes place between processes with close ranks • The communication cost between processes with close host names is low • However, • Applications have various comm. patterns • Host names don’t necessarily have a correlation to communication costs

Our Rank Assignment Scheme • Find a rank-process mapping with low communication overhead • Map the rank assignment problem to the Quadratic Assignment Problem • QAP • Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:

Solving QAPs • NP-Hard, but there are heuristics for finding good suboptimal solutions • Library based on GRASP [Resende et al. ’96] • Test against QAPLIB [Burkard et al. ’97] • Instances of up to n = 256 • n processors for problem size n • Approximate solutions that were within one to two percent of the best known solution in under one second

Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

Experimental Environment • Xeon/Pentium M • Linux • Intra-cluster RTT: 60-120 microsecs • TCP send/recv bufs: 256KB ea. sheepXX (64 nodes) 10.8ms chibaXXX (64 nodes) 6.8ms 6.9ms 4.4ms 4.3ms istbsXXX (64 nodes) 0.3 ms hongoXXX (64 nodes) FW

Experiment 1: Conn. Management • Measure the performance of the NPB with limited numbers of connections • MC-MPI • Limit the number of connections to 10%, 20%, ..., 100% by varying  • Random • Establish a comparable number of connections randomly

BT, LU, MG and SP SOR (Successive Over-Relxation) LU (Lower-Upper)

BT, LU, MG and SP (2) MG (Multi-Grid) BT (Block Tridiagonal)

BT, LU, MG and SP (3) • % of connections actually established was lower than that shown by the x-axis • B/c of lazy connection establishment • To be discussed in more detail later SP (Scalar Pentadiagonal)

EP • EP involves very little communication EP (Embarrassingly Parallel)

IS Performance decrease due to congestion! IS (Integer Sort)

Experiment 2: Lazy Conn. Establish. • Compare our lazy conn. establishment method with an MPICH-like method • MC-MPI • Select  so that the maximum number of allowed connections is 30% • MPICH-like • Establish connections on demand without preselecting candidate connections(we can also say that we preselect all connections)

Relative Performance Experiment 2: Results Comparable number of conns. except for IS Comparable performance except for IS Connections Established

Experiment 3: Rank Assignment • Compare 3 assignment algorithms • Random • Hostname (24 patterns) • Real host names (1) • What if istbsXXX were named sheepXX, etc. (23) • MC-MPI (QAP) chibaXXX sheepXX hongoXXX istbsXXX

LU and MG MG LU Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)

BT and SP SP BT Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)

BT and SP (cont’d) • Rank Assignment • Traffic Matrix Destination Hostname Rank MC-MPI (QAP) Rank Cluster A Cluster C Cluster B Cluster D Source

EP and IS IS EP Hostname (Best) Hostname (Worst) Hostname Random QAP (MC-MPI)

Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion

Conclusion • MC-MPI • Connection management • High performance with connections between just 10% of all process pairs • Rank assignment • Up to 300% faster than locality-unaware assignments • Future Work • An API to perform profiling w/in a single run • Integration of adaptive collectives

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Presentation Transcript

Locality-aware connection establishment

Khazana: Flexible Wide Area Consistency Management

Locality Aware Dynamic Load Management for Massively Multiplayer Games

Wide-Area Traffic Management for Cloud Services

Flexible Wide Area Consistency Management

A Locality-Aware Memory Hierarchy for Energy-Efficient

Career Connection assignment

Multi-core and Network Aware MPI Topology Functions

Software and Hardware Support for Locality Aware High Performance Computing

Area Connection and the FTC

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Locality Aware Dynamic Load Management for Massively Multiplayer Games

Locality Aware Dynamic Load Management for Massively Multiplayer Games

Locality Management and Local Area Partnerships

Optimizations for Locality-Aware Structured Peer-to-Peer Overlays

Locality Aware Mechanisms for Large-scale Networks

EXPLOITING VALUE LOCALITY FOR SECURE-ENERGY AWARE COMMUNICATION

Locality Aware Network Solutions

Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers

Area Wide Pest Management Programmes and Trade

Locality Aware Dynamic Load Management for Massively Multiplayer Games

Local Area and Wide Area Networks