VGrADS Tools Activities

VGrADS Tools Activities Chuck KoelbelVGrADS Workshop, February 23, 2006

Tools: Where We Are • Achieved • Initial workflow scheduling methods • “Anirban scheduler” [Rice, UH, UCSD, ISI] • Supported by performance prediction, NWS • Initial fault tolerance implementations • FT-MPI [UTK] • Optimal checkpoint scheduling [UCSB] • Platform-independent application launch and optimization • LLVM, run-time reoptimization experiments [Rice] • Working On It • Virtual Grid scheduling methods • Building workflow DAGs

Ongoing Tools Thrusts • Scheduling methods • Most of rest of talk [Rice, UCSD, UCSB, UTK, ISI] • All based on pre-scheduling (aka off-line scheduling) of workflow (aka datflow, aka DAGs) using performance prediction • Performance prediction • Queue delay model [UCSB] • Other • Launching and reoptimization [Rice] • DAG construction [Rice]

Scheduling Methods • Two-level (choose VG, map onto it) • Richard Huang (UCSD), Anirban Mandal & Ryan Zhang (Rice) • Batch queue (include est. queue delay in cost model) • Anirban Mandal (Rice), Dan Nurmi (UCSB) • Cluster (assign block of tasks to cluster) • Anirban Mandal (Rice) • Provisioning (minimize reservation time + execution time) • Gurmeet Singh (ISI) • Robust (schedule to reduce sensitivity to variability) • Zhiao Shi (UTK)

Scheduling Comparison

Results Huang - 2-level scheduler, Montage DAG Mandal - cluster scheduler, EMAN DAG Maximize f Shi - robust scheduler, ??? DAG

Tools Research Going Forward • Interface between vgES and schedulers • What capabilities can schedulers expect from vgES? • How can schedulers exploit this capability? • How can schedulers work around this capability? • Some interesting operating points • vgES provisions VG / application takes what’s given • vgES returns shared VG nodes / application adapts to perf variance • vgES returns queued VG resources / application manages queues • vgES provisions VG, monitors for additional resources /application starts immediately, adapts to changes

Tools Research Going Forward • Generating vgDL request for 2-level methods • Balance request complexity vs. difficulty scheduling onto VG VG1 = ClusterOf (node) [1:N] [Rank=Cluster.nodes] {node = [CPU=Opteron]} VG2 = ClusterOf (node) [1:N] [Rank=Cluster.nodes*node.clock] {node = [CPU=Opteron]} VG3 = ClusterOf (node) [1:N] [Rank=PerfModel(Cluster.nodes,Cluster.bw,node.clock,node.mem)] {node = [CPU=Opteron]} • Automatic vgDL generation from DAGs • Template-driven? Heuristic-driven? • Extended vgDL capabilities • Global constraints (e.g. total # of nodes) • Temporal constraints (e.g. available within 60 min) • Probabalistic constraints (e.g. 95% likely to succeed)

Tools Research Going Forward • New scheduling criteria • Deadline scheduling • Economic scheduling • Real-time scheduling • New scheduling situations • Rescheduling • Adapting to new resources • Adapting to resource failures • Incremental scheduling • Managing dynamic applications • “Horizon scheduling” for limited-time predictions • Hybrid static / dynamic scheduling • Contingency scheduling • Static planning for dynamic optimizations

Backup Slides Beyond This Point

Two-Level Scheduling (Huang) • Target Application • Workflows represented by DAG • Performance Metrics • Application Turn-Around Time • Resource Selection • Scheduling Time • Application Makespan • Major Assumptions of the Scheduler • Resources are dedicated • Resources available for duration of application • Scheduling Algorithms (so far) • Greedy • Modified Critical Path

Experimental Setup • Use synthetic resource generator to generate 1000 clusters (33,667 hosts) • Execute one “simple” (greedy) and one “complex” (Modified Critical Path) scheduling heuristic • Tests on Montage DAG

Two-phase scheduling necessary to avoid excessive scheduling time Appropriate virtual grids necessary for better performance Using more complex heuristic did not improve performance if you have the appropriate resource abstractions! Initial Results Original CCR CCR = 0.1

Batch Queue Scheduling (Mandal) • Make batch-queue predictions on-the-fly from the ``live’’ systems • New NWS functionality • Parameterize the performance models using the 95% upper bound on the median prediction as a prediction of delay • The performance models can take into account the amount of time needed to start a computation • Run a top-down (heuristic) scheduler to choose a resource set • Scheduler is smart enough to understand that the start-up delay can be amortized Joint work with Dan Nurmi and Rich Wolski

While all available components not mapped For each (component, resource) pair ECT(c,r) = rank(c,r) + EAT(r) End For each Run min-min, max-min and sufferage Store mapping End while Top-Down Scheduling For each heuristic Until all components mapped Map available components to resources Select mapping with minimum makespan Top-Down

Scheduling onto Batch-Queue Systems • Details: Modification of Top-Down scheduler • At every scheduling step, take into account the estimated time the job has to wait in the queue in the estimated completion time for the job [ECT(c,r) in the algorithm] • Keep track of the queue wait times for each cluster and the number of nodes that correspond to the queue wait time • With each mapping, update the estimated availability time [EAT in the algorithm] with the queue wait time, as required

Scheduling onto Batch-Queue Systems: Example Cluster 1 Cluster 0 Input DAG R1 R0 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T

Discussions • Experiments to evaluate EMAN scheduling with batch-queues • Control experiment • Schedule with and without queue-wait estimates, run application with the two schedules on Teragrid and compare turnaround times • Accuracy of the results - how close to actual • Other future issues • Predictive/opportunistic approach • Submit to queues even before data arrives with hope that data arrives by the time job moves to the front of the queue • Point-valued predictions of probabilistic systems are problematic • Need to schedule based on ranges or distributions • Probabilistic deadline scheduling

Cluster Scheduling (Mandal) • Motivation: Scheduler scaling problem for ‘large’ Grids • Idea: Schedule directly onto clusters • Input: • Workflow DAG with restricted structure - nodes at the same level do the same computation • Set of available Clusters (numNodes, arch, CPU speed etc.) and inter-cluster network connectivity (latency, bandwidth) • Per-node performance models for each cluster • Output: • Mapping: for each level the number of instances mapped to each cluster • Objective: • Minimize makespan

Scheduling onto Clusters: Modeling • Abstract modeling of mapping problem for a DAG level • Given: • N instances • M clusters • r1..rM nodes/cluster • t1..tM - rank value per node per cluster (incorporates both computation and communication) • Aim: • To find a partition (n1, n2,… nM) of N such that overall time is minimized with n1+n2+..nM = N • Analytical solution: • No ‘obvious’ solution because of discrete nature of problem

Scheduling onto Clusters • Iterative solution • Big picture: • Iterative assignment of tasks to clusters • DP approach For each instance, i from 1 to N For each cluster, j from 1 to M Tentatively map i onto j Record makespan for each j by taking care of round(j) End For each Find cluster, p with minimum makespan increase Map i to p Update round(p), numMapped(p) End For each O(#instances * #clusters)

Scheduling onto Clusters: Evaluation • Application • Representative DAGs from Montage and EMAN with varying widths • Known performance models • Simulation Platform • Resource Model: Synthetic cluster generator (Kee et al SC’04) • Network Model: BRITE to generate network topology, generate latency/bandwidth following a truncated normal distribution • Experiment • Varying number of clusters (nodes) • 250 to 1000 clusters (8.5K to 36K nodes) • Ran three scheduling approaches • Heuristic (min-min/max-min/sufferage heuristics based) • Greedy (simple greedy heuristic based) • Simple (the Cluster level scheduler) • Compared turnaround time (Makespan + Scheduling Time)

Scheduling onto Clusters: Results Montage Application • Cluster level Scheduler (Simple) offers • Scalability - scales to ‘large’ Grids • Improved turnaround time • No significant degradation of application makespan quality 717 node Montage DAG 103 node Montage DAG

Scheduling onto Clusters: Results EMAN Application • Cluster level Scheduler (Simple) offers • Scalability - scales to ‘large’ Grids • Improved turnaround time • No significant degradation of application makespan quality 666 node EMAN DAG 171 node EMAN DAG

Robust Task Scheduling (Shi) • Task scheduling: Assigning tasks of a meta-task (workflow-type application) to a set of resources, and achieving certain goals, e.g. minimizing the schedule length • NP-complete, finding an optimal solution is either impossible or impractical. • Heuristics (list scheduling, duplication, clustering) , optimization (Genetic algorithm, Simulated annealing etc.) • Previously we focused on list scheduling algorithm considering the case that processors has different capabilities.

Non-deterministic environment • Actual resource environment is non-deterministic inherently due to resource sharing. • Previously we used expected values of execution times of tasks, network speed. • The optimal solution for task scheduling problem with expected values of resource characteristics is NOT optimal for the corresponding problem with non-deterministic values. • We focus on variable execution time in this work.

Possible Solutions • Static scheduling • Overestimate the execution time to avoid exceeding the allotted use of machine at the expense of machine utilization • Compute the schedules for various scenarios and at run time adopt the one which fits the current status. • Find schedules more robust to variable execution time. • Dynamic scheduling • at each point of scheduling (when a task is ready to be executed), gather current resource information and compute a new schedule for unscheduled tasks

Robustness • Schedule delay • M0(s): Makespan of the schedule s obtained with expect values (execution time) • M(s): Makespan of schedules with real execution time • Robustness • Each realization of expect values gives different schedule delay.

Slack • Slack of a task node is defined as follow: • Slack is closely related to robustness. • large slack means a task node can tolerate large increase of execution time without increasing the makespan slack(ni) = makespan – [b_level(ni)+t_level(ni)]

Robustness and slack Disjunctive graph Disjunctive graph is used to calculate expected makespan and real makespan slack(ni) = makespan – [b_level(ni)+t_level(ni)]

Task execution time modeling • Least Time to Compute (LTC) matrix : {ltcij} • time to compute task i on processor j • generated from a single number with twice gamma distribution on 2 dimensions (machine, task) • different values of gamma parameters represent different heterogeneities of machine or task • Uncertainty level: {ulij} • expected actual time to compute / least time to compute • generated from a single number with twice gamma distribution on 2 dimensions (machine, task) • Actual computation time: actij = ltcij * ulij

Genetic Algorithm • 1.[Start]Generate initial population of n chromosomes (suitable solutions for the problem) • 2.[Fitness]Evaluate the fitness f(x) of each chromosome x in the population • 3.[New population]Create a new population by repeating following steps until the new population is complete • [Selection]Select two parent chromosomes from a population according to their fitness • [Crossover]With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents. • [Mutation]With a mutation probability mutate new offspring at each locus (position in chromosome). • [Accepting]Place new offspring in the new population • 4.[Replace]Use new generated population for a further run of the algorithm • 5.[Test]If the end condition is satisfied, stop, and return the best solution in current population • 6.[Loop]Go to step 2

Single objective optimization makespan optimization robustness optimization

Multi-objective optimization • Goal: Minimize makespan and maximize robustness at the same time. • Conflict - there cannot be a single optimum solution which simultaneously optimizes both objectives. • Solution – seek balance between the 2 objectives.

Multi-objective optimization • Classical methods • weighted sum • -constraint • Weigthed sum • scalarizes multiple objectives into a single objective • -constraint • optimize one of the objectives , subject to some constraints imposed on the other objectives

Weighted sum Objective function aws: average weighted slack: niis scheduled on pj

-constraint • Objective: • maximize aws (average weighted slack) • subject to: ms < ε*ms0 • Solutions: • feasible (ms < ε*ms0) • infeasible (ms ≥ε*ms0) • Fitness:

Summary • Studied robust scheduling in non-deterministic environment using GA. • Provided a measurement of robustness. • Robust schedule could be generated through the optimization of average weighted slack (AWS) of a task graph. • Makespan and robustness are two conflicting objectives. • Multi-objective optimization methods are employed. • Weighted sum method is easy to use and intuitive. Setting up an appropriate weight vector depends on the scaling of each objective function. Normalization of objectives is usually required. • -constraint methods let user optimize one objective while imposing constraints on other objectives.

VGrADS Tools Activities