1 / 78

Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects. Designing Parallel Operating Systems using Modern Interconnects. Eitan Frachtenberg (eitanf@lanl.gov) With Fabrizio Petrini, Juan Fernandez, Dror Feitelson, Jose-Carlos Sancho, Kei Davis.

lular
Télécharger la présentation

Designing Parallel Operating Systems using Modern Interconnects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Parallel Operating Systems using Modern Interconnects Designing Parallel Operating Systems using Modern Interconnects Eitan Frachtenberg (eitanf@lanl.gov) With Fabrizio Petrini, Juan Fernandez, Dror Feitelson, Jose-Carlos Sancho, Kei Davis Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world

  2. Cluster Supercomputers • Growing in prevalence and performance, 7 out of 10 top supercomputers • Running parallel applications • Advanced, high-end interconnects

  3. Distributed vs. Parallel Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations • Distributed—local information, relatively small number of point-to-point messages • Parallel—global synchronization: barriers, reductions, exchanges

  4. System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

  5. Problems with System Software Independent single-node OS (e.g. Linux) connected by distributed dæmons: • Redundant components • Performance hits • Scalability issues • Load balancing issues

  6. OS’s Collective Operations Many OS tasks are inherently global or collective operations: • Job launching, data dissemination • Context switching • Job termination (normal and forced) • Load balancing

  7. Resource Management Resource Management Parallel I/O Parallel I/O Fault Tolerance Fault Tolerance Local Operating System Local Operating System User-Level Communication User-Level Communication Job Scheduling Job Scheduling Node 1 Node 2 Job Scheduling Fault Tolerance Communication Parallel I/O Resource Mgmt Global Parallel Operating System

  8. The Vision • Modern interconnects are very powerful • collective operations • programmable NICs • on-board RAM • Use a small set of network mechanisms as parallel OS infrastructure • Build upon this infrastructure to create unified system software • System software Inherits scalability and performance from network features

  9. Example: ASCI Q Barrier [HotI’03]

  10. Parallel OS Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

  11. Core Primitives on QsNet S D1 D3 D4 D2 Source Event Destination Events • System software built atop three primitives • Xfer-And-Signal (QsNet): • Node S transfers block of data to nodes D1, D2, D3 and D4 • Events triggered at source and destinations

  12. Core Primitives (cont.) D1 D3 D4 D2 • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 S • Is V {, , >} to Value?

  13. Core Primitives (cont.) D1 D3 D4 D2 • System software built atop three primitives • Compare-And-Write (QsNet): • Node S compares variable V on nodes D1, D2, D3 and D4 • Partial results are combined in the switches S

  14. System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

  15. Inherits scalability from network primitives: • Data dissemination and coordination • Interactive job launching speeds • Context-switching at milliseconds level • Described in [SC’02] Scalable Tool for Resource Management

  16. State of the Art in ResourceManagement Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using • TCP/IP—favors portability over performance, • Poorly-scaling algorithms for the distribution/collection of data and control messages • Favoring development time over performance Scalable performance not important for small clusters but crucial for large ones. There exists a need for fast and scalable resource management.

  17. Experimental Setup • 64 nodes/256 processors ES40 Alphaserver cluster • 2 independent network rails of Quadrics Elan3 • Files are placed in ramdisk in order to avoid I/O bottlenecks and expose the performance of the resource management algorithms

  18. Launch Times (Unloaded System) The launch time is constant when we increase the number of processors. STORM is highly scalable

  19. Launch Times (Loaded System, 12 MB) Worst case: 1.5seconds to launch a 12 MB file on 256 processors

  20. Measured and Estimated Launch Times The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes

  21. Comparative Evaluation(Measured & Modeled)

  22. System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

  23. Job Scheduling • Controls the allocation of space and time resources to jobs • HPC apps have special requirements • Multiple processing and network resources • Synchronization ( < 1ms granularity) • Potentially memory hogs with little locality • Has significant effect on throughput, responsiveness, and utilization

  24. First-Come-First-Serve (FCFS)

  25. Gang Scheduling (GS)

  26. Implicit CoScheduling

  27. Hybrid Methods • Combine global synchronization & local information • Rely on scalable primitives for global coordination and information exchange • First implementation of two novel algorithms: • Flexible CoScheduling (FCS) • Buffered CoScheduling (BCS)

  28. Flexible CoScheduling (FCS) • Measure communication characteristics, such as granularity and wait times • Classify processes based on synchronization requirements • Schedule processes based on class • Described in [IPDPS’03]

  29. FCS Classification Fine Coarse Granularity DC Locally scheduled Long Short Block times CS Always gang-scheduled F Preferably gang-scheduled

  30. Methodology • Synthetic, controllable MPI programs • Workload • Static: all jobs start together • Dynamic: different sizes, arrival and run times • Various schedulers implemented: • FCFS, GS, FCS, SB (ICS), BCS • Emulation vs. simulation • Actual implementation takes into account all the overhead and factors of a real system

  31. Hardware Environment • Environment ported to three architectures and clusters: • Crescendo: 32x2 Pentium III, 1GB • Accelerando: 32x2 Itanium II, 2GB • Wolverine: 64x4 Alpha ES40, 8GB

  32. Synthetic Application • Bulk synchronous, 3ms basic granularity • Can control: granularity, variability and Communication pattern

  33. Synthetic Scenarios Balanced ComplementingImbalancedMixed

  34. Turnaround Time

  35. Dynamic Workloads [JSSPP’03] • Static workloads are simple and offer insights, but are not realistic • Most real-life workloads are more complex • Users submit jobs dynamically, of varying time and space requirements

  36. Dynamic Workload Methodology • Emulation using a workload model [Lublin03] • 1000 jobs, approx. 12 days, shrunk to 2 hrs • Varying load by factoring arrival times • Using same synthetic application, with random: • Arrival time, run time, and size, based on model • Granularity (fine, medium, coarse) • communication pattern (ring, barrier, none) • Recent study with scientific apps (yet unpublished)

  37. Load – Response Time

  38. Load – Bounded Slowdown

  39. Timeslice – Response Time

  40. System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

  41. Buffered CoScheduling (BCS) • Buffer all communications • Exchange information about pending communication every time slice • Schedule and execute communication • Implemented mostly on the NIC • Requires fine-grained heartbeats • Described in [SC’03]

  42. Design and Implementation • Global synchronization • Strobe sent at regular intervals (time slices) • Compare-And-Write + Xfer-And-Signal (Master) • Test-Event (Slaves) • All system activities are tightly coupled • Global Scheduling • Exchange of communication requirements • Xfer-And-Signal + Test-Event • Communication scheduling • Real transmission • Xfer-And-Signal + Test-Event

  43. Design and Implementation • Implementation in the NIC • Application processes interact with NIC threads • MPI primitive  Descriptor posted to the NIC • Communications are buffered • Cooperative threads running in the NIC • Synchronize • Partial exchange of control information • Schedule communications • Perform real transmissions and reduce computations • Comp/comm completely overlapped

  44. Design and Implementation • Non-blocking primitives: MPI_Isend/Irecv

  45. Design and Implementation • Blocking primitives: MPI_Send/Recv

  46. Performance Evaluation • BCS MPI vs. Quadrics MPI • Experimental Setup • Benchmarks and Applications • NPB (IS,EP,MG,CG,LU) - Class C • SWEEP3D - 50x50x50 • SAGE - timing.input • Scheduling parameters • 500μs communication scheduling time slice (1 rail) • 250μs communication scheduling time slice (2 rails)

  47. Performance Evaluation • Benchmarks and Applications (C) Application Slowdown IS (32PEs) 10.40% EP (49PEs) 5.35% MG (32PEs) 4.37% CG (32PEs) 10.83% LU (32PEs) 15.04% SWEEP3D (49PEs) -2.23% SAGE (62PEs) -0.42%

  48. Performance Evaluation • SAGE - timing.input (IA32) 0.5% SPEEDUP

  49. Blocking Communication Blocking vs. Non-blocking SWEEP3D (IA32) MPI_Send/Recv  MPI_Isend/Irecv + MPI_Waitall

  50. System Software Components Resource Management Job Scheduling Communication Library System Software Fault Tolerance Parallel I/O

More Related