1 / 84

Parallel Computing Systems Part III: Job Scheduling

Parallel Computing Systems Part III: Job Scheduling. Dror Feitelson Hebrew University. Types of Scheduling. Task scheduling Application is partitioned into tasks Tasks have precedence constraints Need to map tasks to processors Need to consider communications too

baina
Télécharger la présentation

Parallel Computing Systems Part III: Job Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing SystemsPart III: Job Scheduling Dror Feitelson Hebrew University

  2. Types of Scheduling • Task scheduling • Application is partitioned into tasks • Tasks have precedence constraints • Need to map tasks to processors • Need to consider communications too • Part of creating an application • Job scheduling • Scheduling competing jobs belonging to different users • Part of the operating system

  3. We’ll Focus on Job Scheduling

  4. Dimensions of Scheduling • Space slicing • Partition the machine into disjoint parts • Jobs get exclusive use of a partition • Time slicing • Multitasking on each processor • Similar to conventional systems • Use both together • Use none – batch scheduling on dedicated machine Feitelson, RC 19790 1997

  5. Space Slicing • Fixed: predefined partitions • Used on CM-5 • Variable: carve out number requested • Used on most systems: Paragon, SP, … • Some restrictions may apply, e.g. torus • Adaptive: modify request size according to system considerations • Less nodes if more jobs are present • Dynamic: modify size at runtime too

  6. Time Slicing • Uncoordinated: each PE schedules on its own Local queue: processes allocated to PEs • Requires load balancing Global queue • Provides automatic load sharing • Queue may become a bottleneck • Coordinated across multiple Pes • Explicit gang scheduling • Implicit co-scheduling

  7. Scheduling Framework Arriving jobs Terminating jobs Allocation • Partitioning with run-to-completion • Order of taking jobs from the queue • Re-definition of job size

  8. Scheduling Framework Arriving jobs Terminating jobs Preemption • Time slicing with preemption • Setting time quanta and priorities • May jobs migrate/change size when preempted?

  9. Memory Considerations • The processes of a parallel application typically communicate • To make good progress, they should all run simultaneously • A process that suffers a page fault is unavailable for communication • Paging should therefore be avoided

  10. Scheduling Framework Memory allocation Dispatching • Two stages of scheduling • Or three stages, with swapping

  11. Variable Partitioning

  12. Batch Systems • Define system of queues with different combinations of resource bounds • Schedule FCFS from these queues • Different queues active at prime vs. non-prime time • Sophisticated/complex services provided • Accounting and limits on users/groups • Staging of data in and out of machine • Political prioritization as needed

  13. Example – SDSC Paragon time 16MB 32MB Low priority Wan et al., JSSPP 1996

  14. The Problem • Fragmentation • If the first queued job needs more processors than are available, need to wait for more to be freed • Available processors remain idle during the wait • FCFS (first come first serve) • Short jobs may be stuck behind long jobs in the queue

  15. The Solution • Out of order scheduling • Allows for better packing of jobs • Allows for prioritization according to desired considerations

  16. Backfilling • Allow jobs from the back of the queue to jump over previous jobs • Make reservations for jobs at the head of the queue to prevent starvation • Requires estimates of job runtimes Lifka, JSSPP 1995

  17. Example job3 job1 job4 processors job2 time FCFS

  18. Example job4 job3 job1 processors job2 time Backfiling reservation

  19. Parameters • Order for going over the queue • FCFS • Some prioritized order (Maui) • How many reservations to make • Only one (EASY) • For all skipped jobs (Conservative) • According to need • Lookahead • Consider one job at a time • Look deeper into the queue

  20. EASY Backfilling Extensible Argonne Scheduling System (first large IBM SP installation) • Definitions: • Shadow time: time at which first queued job can run • Extra processors: processors left over when first job runs • Backfill if • Job will terminate by shadow time • Job needs less than extra processors Lifka, JSSPP 1995

  21. First Case job4 job3 processors job2 time shadow time

  22. Second Case extra processors job4 job3 job1 processors job2 time

  23. Properties • Unbounded delay • Backfill jobs will not delay first queued job • But they may delay other queued jobs… Mu’alem & Feitelson, IEEE TPDS 2001

  24. Delay job4 job3 job2 processors job1 time

  25. Delay job4 job3 job2 processors delay job1 time

  26. Properties • Unbounded delay • Backfill jobs will not delay first queued job • But they may delay other queued jobs… • No starvation • Delay of first queued job is bounded by runtime of current jobs • When it runs, the second queued job becomes first • It is then immune of further delays Mu’alem & Feitelson, IEEE TPDS 2001

  27. User Runtime Estimates • Small estimates allow job to backfill and skip the queue • Too short estimates risk the job being killed because it exceeded its time • So estimates may be expected to be accurate

  28. They Aren’t Mu’alem & Feitelson, IEEE TPDS 2001

  29. Surprising Consequence Performance is actually better if runtime estimates are inaccurate! Experiment: replace user estimates by up to f times the actual runtime (Data for KTH)

  30. Exercise Understand why this happens • Run simulations of EASY backfilling with real workloads • Insert instrumentation to record detailed behavior • Try to find why f10 is better than f=1 • Try to find why user estimates are so bad

  31. Hint • It may be beneficial to look at different job classes • Example: EASY vs. Conservative • EASY favors small long jobs: can backfill despite delaying non-first jobs • This comes at expense of larger short jobs • Happens more with user estimates than with accurate estimates

  32. Another Surprise Possible to improve performance by multiplying user estimates by 2! (table shows reduction in %)

  33. The MAUI Scheduler Queue order depends on • Waiting time in queue • Promote equitable service • Fair share status • Political priority • Job parameters • Favor small/large jobs etc. • Number of times skipped by backfill • Prevent starvation • Problem: conflicts are possible, hard to figure out what will happen Jackson et al, JSSPP 2001

  34. Fair Share • Actually unfair: strive for specific share • Based on comparison with historical data • Parameters: • How long to keep information • How to decay old information • Specifying shares for user or group • Shares are upper/lower bound or both • Handling of multiple resources by maximal “PE equivalents” (usage out of total available)

  35. Lookahead • EASY uses a greedy algorithm and considers jobs in one given order • The alternative is to consider a set of jobs at once and try to derive an optimal packing

  36. Dynamic Programming • Outer loop: number of jobs that are being considered • Inner loop: number of processors that are available Achievable utilization on 3 processors using only first 2 jobs Edi Shmueli, IBM Haifa

  37. Cell Update • If j.size > p job is too big to consider uj,p = uj-1,p j is not selected • Else consider adding job j u’ = uj-1,p-j.size + j.size if u’ > uj-1,p then uj,p = u’ j is selected else uj,p = uj-1,p j is not selected

  38. Preventing Starvation • Option I: only use jobs that will terminate by the shadow time • Option II: make a reservation for the first queued job (as in EASY) Requires a 3D data structure: • Jobs being considered • Processors being used now • Extra processors used at the shadow time

  39. Dynamic Programming • In the end the bottom-right cell contains the maximal achievable utilization • The set of jobs to schedule is obtained by the path of selected jobs

  40. Performance • Backfilling leads to significant performance gains relative to FCFS • More reservations reduce performance somewhat (EASY better than conservative) • Lookahead improves performance somewhat

  41. Dynamic Partitioning

  42. Two-Level Scheduling • Bottom level – processor allocation • Done by the system • Balance requests with availability • Can change at runtime • Top level – process scheduling • Done by the application • Use knowledge about priorities, holding locks, etc.

  43. Programming Model • Applications required to handle arbitrary changes in allocated processors • Workpile model • Easy to change number of worker threads • Scheduler activations • Any change causes an upcall into the application, which can reconsider what to run

  44. Equipartitioning • Strive to give all applications equal numbers of processors • When a job arrives take some processors from each running job • When it terminates, give some to each other job • Fair and similar to processor sharing • Caveats • Applications may have a maximal number of processors they can use efficiently • Applications may need a minimal number of processors due to memory constraints • Reconfigurations require many process migrations Not an issue for shared memory

  45. Folding • Reduce processor preemptions by selecting a partition and dividing it in half • All partition sizes are powers of 2 • Easier for applications: when halved, multitask two processes on each processor McCann & Zahorjan, SIGMETRICS 1994

  46. The Bottom Line • Places restrictions on programming model • OK for workpile, Cray autotasking • Not suitable for MPI • Very efficient at the system level • No fragmentation • Load leads to smaller partitions and reduced overheads for parallelism • Of academic interest only, in shared memory architectures

  47. Gang Scheduling

  48. Definition • Processes are mapped one-to-one on processors • Time slicing is used for multiprogramming • Context switching is coordinated across processors • All processes are switched at the same time • Either all run or none do • This applies to gangs, typically all processes in a job

  49. CoScheduling • Variant in which an attempt is made to schedule all the processes, but subsets may also be scheduled • Assumes “process working set” that should run together to make progress • Does this make sense? • All processes are active entities • Are some more important than others? Ousterhout, ICDCS 1982

More Related