1 / 24

Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters

Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters. Mala Ghanesh Satish Kumar Jaspal Subhlok University of Houston CCGrid, May 2005. Scheduling Parallel Threads. Space Sharing/Gang Scheduling

gforet
Télécharger la présentation

Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University of Houston CCGrid, May 2005

  2. Scheduling Parallel Threads Space Sharing/Gang Scheduling • All parallel threads of an application scheduled together by a global scheduler Independent Scheduling • Threads scheduled independently on each node of a parallel system by the local scheduler

  3. a1 a2 a3 a4 a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 b1 b2 b3 b4 b1 b2 b3 b4 a1 a2 a3 a4 b1 b2 b3 b4 b1 b2 b3 b4 b1 b2 b3 b4 Space Sharing and Gang Scheduling Nodes Space sharing Gang scheduling N1 N2 N3 N4 N1 N2 N3 N4 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Time slice Threads of application A are a1, a2, a3, a4 Threads of application B are b1, b2, b3, b4

  4. a1 a2 b3 a4 a1 a2 a3 a4 b1b2b3a4 b1 b2 b3 b4 a1 a2 a3 a4 a1 a2 a3 a4 b1 b2a3b4 b1 b2 b3 b4 a1 a2 a3 a4 a1a2b3 b4 b1 b2a3 b4 b1 b2 b3 b4 Independent Scheduling and Gang Scheduling Nodes Independent Scheduling Gang scheduling N1 N2 N3 N4 N1 N2 N3 N4 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Time slice

  5. Gang versus Independent Scheduling Gang scheduling is de-facto standard for parallel computation clusters How does independent scheduling compare ? + More flexible – no central scheduler required + Potentially uses resources more efficiently - Potentially increases synchronization overhead

  6. a1 a2b3b4 b1b2a3a4 a1 a2b3b4 b1b2a3a4 a1a2b3 b4 b1b2a3 a4 Synchronization/Communication with Independent Scheduling Nodes With strict independent round robin scheduling parallel threads may never be able to communicate! Fortunately scheduling is never strictly round robin, but this is a significant performance issue N1 N2 N3 N4 T1 T2 T3 T4 T5 T6 Time slice

  7. Research in This Paper How does node sharing with independent scheduling perform in practice ? • Improved resource utilization versus higher synchronization overhead ? • Dependence on application characteristics ? • Dependence on CPU time slice values ?

  8. Experiments All experiments with NAS benchmarks on 2 clusters Benchmark programs executed: • Dedicated mode on a cluster • With node sharing with competing applications • Slowdown due to sharing analyzed Above experiments conducted with • Various node and thread counts • Various CPU time slice values

  9. Experimental Setup • Two clusters are used: • 10 node, 1 GB RAM, dual Pentium Xeon processors, RedHat Linux 7.2, GigE interconnect • 18 node 1 GB RAM, dual AMD Athlon processors, RedHat Linux 7.3, GigE interconnect • NAS Parallel Benchmarks 2.3, Class B MPI Versions • CG, EP, IS, LU, MP compiled for 4, 8,16, 32 threads • SP and BT compiled for 4, 9, 16, 36 threads • IS (Integer Sort) and CG (Conjugate Gradient) are most communication intensive benchmarks. • EP(Embarassingly Parallel) has no communication.

  10. Experimental # 1 • NAS Benchmarks compiled for 4, 8/9 and 16 threads • Benchmarks first executed in dedicated mode with one thread per node • Then executed with 2 additional competing threads on each node • Each node has 2 CPUs – minimum 3 total threads are needed to cause contention • Competing load threads are simple compute loops with no communication • Slowdown (%age increase in execution time) plotted • Nominal slowdown is 50% - used for comparison as gang scheduling slowdown

  11. Results: 10 node cluster 4 nodes 80 Expected slowdown with gang scheduling 8/9 nodes 70 60 50 Percentage Slowdown 40 30 20 10 0 CG EP IS LU MG SP BT Avg Benchmark • Slowdown ranges around 50% • Some increase in slowdown going from 4 to 8 nodes

  12. Slowdown Results: 18 node cluster • Broadly similar • Slow increase in slowdown from 4 to 16 nodes

  13. Remarks • Why is slowdown not much higher ? • Scheduling is not strict round robin – a blocked application thread will get scheduled again on message arrival • leads to self synchronization - threads of the same application across nodes get scheduled together • Applications often have significant wait times that are used by competing applications with sharing • Increase in slowdown with more nodes is expected as communication operations are more complex • The rate of increase is modest

  14. Experiment # 2 • Similar to the previous batch of experiments, except… • 2 Application threads per node • 1 load thread per node • Nominal slowdown is still 50%

  15. Performance: 1 and 2 app threads/node 1 app thread per node, 4 nodes 2 app threads per node, 4/5 nodes 1 app thread per node, 8/9 nodes 2 app threads per node, 8 nodes 80 Expected slowdown with gang scheduling 70 60 50 Percentage Slowdown 40 30 20 10 0 CG EP IS LU MG SP BT Avg Slowdown is lower for 2 threads/node

  16. Performance: 1 and 2 app threads/node 1 app thread per node, 4 nodes 2 app threads per node, 4/5 nodes 1 app thread per node, 8/9 nodes 2 app threads per node, 8 nodes 80 70 60 50 Percentage Slowdown 40 30 20 10 0 CG EP IS LU MG SP BT Avg • Slowdown is lower for 2 threads/node • competing with one 100% compute thread (not 2) • scaling a fixed size problem to more threads means each thread uses CPU less efficiently • hence more free cycles available

  17. Experiment # 3 • Similar to the previous batch of experiments, except… • CPU time slice quantum varied from 30 to 200 ms. • (default was 50 msecs) • CPU time slice quantum is the amount of time a process gets when others are waiting in ready queue • Intuitively, longer time slice quantum means • a communication operation between nodes is less likely to be interrupted due to swapping – good • a node may have to wait longer for a peer to be scheduled, before communicating - bad

  18. CPU time slice=30 ms CPU time slice=100 ms 100 CPU time slice=50 ms CPU time slice=200 ms 90 80 70 60 Percentage Slowdown 50 40 30 20 10 0 CG EP IS LU MG SP BT Performance with different CPU time slice quanta • Small time slices are uniformly bad • Medium time slices (50 ms and 100 ms) generally good • Longer time slice good for communication intensive codes

  19. Conclusions • Performance with independent scheduling competitive with gang scheduling for small clusters. • Key is passive self synchronization of application threads across the cluster • Steady but slow increase in slowdown with larger number of nodes • Given the flexibility of independent scheduling, it may be a good choice for some scenarios

  20. Model Data Sim 2 Vis Sim 1 Pre Stream Broader Picture: Distributed Applications on Networks: Resource selection, Mapping, Adapting Which nodes offer best performance ? Application Network

  21. End of Talk! FOR MORE INFORMATION: www.cs.uh.edu/~jaspaljaspal@uh.edu

  22. Model Data Sim 2 Vis Sim 1 Pre Stream Mapping Distributed Applications on Networks: “state of the art” Mapping for Best Performance • 1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS, Remos) • Find “best” nodes for execution based on network status • But the approach has significant limitations… • Knowing network status is not the same as knowing how an application will perform • Frequent measurements are expensive, less frequent measurements mean stale data

  23. Discovered Communication Structure of NAS Benchmarks 1 1 1 0 0 0 2 2 3 3 3 2 BT CG IS 1 1 1 0 0 0 2 2 2 3 3 3 LU MG SP 1 0 2 3 EP

  24. CPU Behavior of NAS Benchmarks

More Related