Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects Designing Parallel Operating Systems using Modern Interconnects Eitan Frachtenberg (eitanf@lanl.gov) With Fabrizio Petrini, Juan Fernandez, Dror Feitelson, Jose-Carlos Sancho, Kei Davis Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world

Cluster Supercomputers • Growing in prevalence and performance, 7 out of 10 top supercomputers • Running parallel applications • Advanced, high-end interconnects

Distributed vs. Parallel Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations • Distributed—local information, relatively small number of point-to-point messages • Parallel—global synchronization: barriers, reductions, exchanges

System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

Problems with System Software Independent single-node OS (e.g. Linux) connected by distributed dæmons: • Redundant components • Performance hits • Scalability issues • Load balancing issues

OS’s Collective Operations Many OS tasks are inherently global or collective operations: • Job launching, data dissemination • Context switching • Job termination (normal and forced) • Load balancing

Resource Management Resource Management Parallel I/O Parallel I/O Fault Tolerance Fault Tolerance Local Operating System Local Operating System User-Level Communication User-Level Communication Job Scheduling Job Scheduling Node 1 Node 2 Job Scheduling Fault Tolerance Communication Parallel I/O Resource Mgmt Global Parallel Operating System

The Vision • Modern interconnects are very powerful • collective operations • programmable NICs • on-board RAM • Use a small set of network mechanisms as parallel OS infrastructure • Build upon this infrastructure to create unified system software • System software Inherits scalability and performance from network features

Example: ASCI Q Barrier [HotI’03]

Parallel OS Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

Core Primitives on QsNet S D1 D3 D4 D2 Source Event Destination Events • System software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 • Events triggered at source and destinations

Core Primitives (cont.) D1 D3 D4 D2 • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 S • Is V {, , >} to Value?

Core Primitives (cont.) D1 D3 D4 D2 • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 • Partial results are combined in the switches S

Inherits scalability from network primitives: • Data dissemination and coordination • Interactive job launching speeds • Context-switching at milliseconds level • Described in [SC’02] Scalable Tool for Resource Management

State of the Art in ResourceManagement Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using • TCP/IP—favors portability over performance, • Poorly-scaling algorithms for the distribution/collection of data and control messages • Favoring development time over performance Scalable performance not important for small clusters but crucial for large ones. There exists a need for fast and scalable resource management.

Experimental Setup • 64 nodes/256 processors ES40 Alphaserver cluster • 2 independent network rails of Quadrics Elan3 • Files are placed in ramdisk in order to avoid I/O bottlenecks and expose the performance of the resource management algorithms

Launch Times (Unloaded System) The launch time is constant when we increase the number of processors. STORM is highly scalable

Launch Times (Loaded System, 12 MB) Worst case: 1.5seconds to launch a 12 MB file on 256 processors

Measured and Estimated Launch Times The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes

Comparative Evaluation(Measured & Modeled)

Job Scheduling • Controls the allocation of space and time resources to jobs • HPC apps have special requirements • Multiple processing and network resources • Synchronization ( < 1ms granularity) • Potentially memory hogs with little locality • Has significant effect on throughput, responsiveness, and utilization

First-Come-First-Serve (FCFS)

Gang Scheduling (GS)

Implicit CoScheduling

Hybrid Methods • Combine global synchronization & local information • Rely on scalable primitives for global coordination and information exchange • First implementation of two novel algorithms: • Flexible CoScheduling (FCS) • Buffered CoScheduling (BCS)

Flexible CoScheduling (FCS) • Measure communication characteristics, such as granularity and wait times • Classify processes based on synchronization requirements • Schedule processes based on class • Described in [IPDPS’03]

FCS Classification Fine Coarse Granularity DC Locally scheduled Long Short Block times CS Always gang-scheduled F Preferably gang-scheduled

Methodology • Synthetic, controllable MPI programs • Workload • Static: all jobs start together • Dynamic: different sizes, arrival and run times • Various schedulers implemented: • FCFS, GS, FCS, SB (ICS), BCS • Emulation vs. simulation • Actual implementation takes into account all the overhead and factors of a real system

Hardware Environment • Environment ported to three architectures and clusters: • Crescendo: 32x2 Pentium III, 1GB • Accelerando: 32x2 Itanium II, 2GB • Wolverine: 64x4 Alpha ES40, 8GB

Synthetic Application • Bulk synchronous, 3ms basic granularity • Can control: granularity, variability and Communication pattern

Synthetic Scenarios Balanced ComplementingImbalancedMixed

Turnaround Time

Dynamic Workloads [JSSPP’03] • Static workloads are simple and offer insights, but are not realistic • Most real-life workloads are more complex • Users submit jobs dynamically, of varying time and space requirements

Dynamic Workload Methodology • Emulation using a workload model [Lublin03] • 1000 jobs, approx. 12 days, shrunk to 2 hrs • Varying load by factoring arrival times • Using same synthetic application, with random: • Arrival time, run time, and size, based on model • Granularity (fine, medium, coarse) • communication pattern (ring, barrier, none) • Recent study with scientific apps (yet unpublished)

Load – Response Time

Load – Bounded Slowdown

Timeslice – Response Time

Buffered CoScheduling (BCS) • Buffer all communications • Exchange information about pending communication every time slice • Schedule and execute communication • Implemented mostly on the NIC • Requires fine-grained heartbeats • Described in [SC’03]

Design and Implementation • Global synchronization • Strobe sent at regular intervals (time slices) • Compare-And-Write + Xfer-And-Signal (Master) • Test-Event (Slaves) • All system activities are tightly coupled • Global Scheduling • Exchange of communication requirements • Xfer-And-Signal + Test-Event • Communication scheduling • Real transmission • Xfer-And-Signal + Test-Event

Design and Implementation • Implementation in the NIC • Application processes interact with NIC threads • MPI primitive  Descriptor posted to the NIC • Communications are buffered • Cooperative threads running in the NIC • Synchronize • Partial exchange of control information • Schedule communications • Perform real transmissions and reduce computations • Comp/comm completely overlapped

Design and Implementation • Non-blocking primitives: MPI_Isend/Irecv

Design and Implementation • Blocking primitives: MPI_Send/Recv

Performance Evaluation • BCS MPI vs. Quadrics MPI • Experimental Setup • Benchmarks and Applications • NPB (IS,EP,MG,CG,LU) - Class C • SWEEP3D - 50x50x50 • SAGE - timing.input • Scheduling parameters • 500μs communication scheduling time slice (1 rail) • 250μs communication scheduling time slice (2 rails)

Performance Evaluation • Benchmarks and Applications (C) Application Slowdown IS (32PEs) 10.40% EP (49PEs) 5.35% MG (32PEs) 4.37% CG (32PEs) 10.83% LU (32PEs) 15.04% SWEEP3D (49PEs) -2.23% SAGE (62PEs) -0.42%

Performance Evaluation • SAGE - timing.input (IA32) 0.5% SPEEDUP

Blocking Communication Blocking vs. Non-blocking SWEEP3D (IA32) MPI_Send/Recv  MPI_Isend/Irecv + MPI_Waitall

Designing Parallel Operating Systems using Modern Interconnects