Adviser: Frank, Yeong -Sung Lin Present by Sean Chou

Optimal service task partition and distribution in grid systemwith star topologyGregory Levitin, Yuan-Shun Dai Adviser: Frank, Yeong-Sung Lin Present by Sean Chou

Agenda • Introduction • The model • Algorithm for determining the pmf of the service time • Numerical example • Conclusions

Introduction • Grid computing is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. [1] • This is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering.

Introduction • The sharing is controlled by a resource managementsystem (RMS) [2] • When the RMS receives a service requestfrom a user, the task can be divided into a set ofexecutionblocks (EBs) that are executed in parallel. • The RMSassigns those EBs to available resources for execution. • After the resources finish the assigned jobs, they return theresults back to the RMS

Introduction • The above grid service process can be approximated by astructure with star topology

Introduction • The performance of grid computing is of great concern. • Usually the measure of grid performance is the taskexecution time (service time). • This index can be significantlyimproved by using the RMS that divides a taskinto a set of EBs which can be executed in parallel bymultiple online resources. • Many complicated and time-consumingtasks that could not be implemented before arecurrently working well under the grid computing environment

Introduction • The service time is a random variable affected by many factors [3]. • There are many resources available online, that have different task processing speeds. • Some resources can fail when running the jobs • The communication links in grid service can fail during the data transmission. • The choice of the group of subtasks assigned to the same EB and running on the same resource can influence the total amount of data transmitted between the RMS and the resource since different subtasks can use common input data blocks.

Introduction • Most of the previous researchers separated performance and reliability into two different fields and studied them individually. • However in fact, performance and reliabilityare closely related and affect each other, in particular whenthe grid computing is implemented.

Introduction • For example, when a task is fully parallelized into n different EBs executed by nresources simultaneously, the performance is high but thereliability can be low because failure of any resource makesthe entire task incomplete. • Therefore, it is worth having some redundantresources to execute same EB especially for thosefailure-prone resources. • However, too many redundancies,even though improving the reliability, can decrease theperformance by not fully parallelizing the task.

Introduction • Performance and reliability should be studied together inthe grid service analysis. • The first model for evaluatingperformance (service time) of grid with star topology takingthe service reliability into account was presented in [4].

Introduction • Optimizing the division of a service task into EBs anddistribution of these EBs among available grid resourcescan considerably improve the service performance. • Thispaper presents an algorithm for solving these optimizationproblems based on the model developed in [4].

The model • 2.1. Service execution by the grid system with stararchitecture • 2.2. Assumptions • 2.3. Service execution time • 2.4. Service reliability and expected performance

The model • Service execution by the grid system with stararchitecture • Different resources are distributed in the grid system. • The considered service can use a given set of resources. • All the resources and communication channels from this set are available at the time when the request for service arrives to the RMS

The model • Each resource is directly connected to the RMS by a single communication channel forming the star topology.

The model • The service task consists of subtasks that can be independently executed by different resources. • Different subtasks may need some common input data blocks for their execution. • The subtasks can be grouped into EBs. The input data for any EB consists of input data blocks necessary for executing all the subtasks belonging to this EB.

The model • The request for service (task execution) arrives to the RMS which forms the EBs and assigns them to different resources for processing. Each resource gets no more than one EB for processing. • The same EB can be assigned to several resources for parallel execution. • If the same EB is processed by several resources, it is completed when first output is returned to the RMS. • The entire task is completed when all of the EBs are completed and their results are returned to the RMS from the resources.

The model • Assumptions • Each resource starts processing the assigned EB immediately after it gets all the necessary input data from the RMS through the corresponding communication channel. Each resource sends the output data to the RMS through the same communication channel immediately after it completes the EB. • Each resource has a given constant processing speed when it is available. Each resource has a given constant failure rate.

The model • Each communication channel has constant data transmission speed (bandwidth) when it is available. Each communication channel has a constant failure rate. • The subtasks belonging to an EB are processed in sequence. The subtask processing time is proportional to its computational complexity. • The data transmission time is proportional to the amount of data transmitted between the RMS and a resource.

The model • The failure rates of the communication channels or resources are the same when they are idle or loaded (hot standby model). The failures at different resources and communication channels are independent. • The RMS is fully reliable. The time of task processing by the RMS (formation and assignment of EBs, sending them to the resources, receiving the results and integrating them into entire task output) is negligible when compared with the EBs’ processing time.

The model • Service execution time • The entire task consists of m subtasks that can be executed independently • Any EB i consisting of a set of subtasks • EB’s computational complexity :

The model • Each subtask j needs a set Bj of data blocks as its input and produces amount Oj of output data. • The set of the input data blocks necessary for execution of EB i is [j2siBj • the amount of data to be transmitted from the RMS to the resource executing this EB is

The model • The total amount of data (input and output) Di that should be transmitted between the RMS and a resource executing EB i is

The model • The EB execution time is defined as time from the beginning of input data transmission from the RMS to a resource to the end of output data transmission from the resource to the RMS. • Therefore, the random time tij of EB i completion by resource j can take two possible values • If the resource j and the communication channel j do not fail until the subtask completion, and otherwise.

The model • EB i can be successfully completed by resource j if this resource and communication link j do not fail before the end of subtask execution. • For constant failure rates of resource j and communication link j one can obtain the probability of EB success as

The model • Assume that each EB i is assigned to resources composing set oi such that oi \ oj ?; for any iaj. • The random time of EB i completion is • The entire task is completed when all of the subtasks (including the slowest one) are completed. • The random task execution time takes the form:

The model • Service reliability and expected performance • In order to estimate both the service reliability and performance of a grid system, different measures can be used depending on the application. • The system reliability ReyT is defined (according to performability concept [5,6]) as a probability that the correct output is produced in time less than y.

The model • The service reliability is defined as the probability that it produces correct outputs without respect to the service time. This index can be referred to as • The conditional expected service time W is considered to be a measure of its performance.

The model • The service task partition into EBs (represented by the sets si, 1piph) and distribution of the EBs among the resources (represented by the sets oi, 1piph) determine the service reliability and performance. • Two optimization problems:

Algorithm for determining the pmf of the service time • The procedure used for the evaluation of service time distribution is based on the universal generating function (u-function) technique. • Its high computational efficiency that allows it to be used in optimization procedures where a large number of different solutions should be estimated.

Algorithm for determining the pmf of the service time • The u-function ui;fjge can define pmf of total completion time tij for EB i assigned to resource j. • This u-function takes the form of

Algorithm for determining the pmf of the service time • The total completion time of EB i assigned to a pair of resources k and j is equal to the minimum of completion times for different resources • To obtain the u-function representing the pmf of this time, composition operator with should be used:

Algorithm for determining the pmf of the service time • The u-function representing the pmf of completion time of EB i assigned to all of the resources from set can be obtained recursively:

Algorithm for determining the pmf of the service time • Having the u-functions uj;ojez for each EB i (1piph) one can obtain the u-function representing the pmf of the entire task completion time Y

Algorithm for determining the pmf of the service time • The final u-function Uh(z represents the pmf of random task completion time Y in the form

Algorithm for determining the pmf of the service time • Algorithm for determining service performance/reliability indices for arbitrary task partition and distribution :

Numerical example • Formulations (9) and (10) define a complicated NPcomplete partitioning/allocation problem. • An exhaustive examination of all possible solutions is not realistic, considering reasonable time limitations.

Numerical example • A heuristic search algorithm is needed which uses only estimates of solution quality and which does not require derivative information to determine the next direction of the search. • The genetic algorithm (GA) has been proven to be an effective optimization tool for a large number of complicated problems in reliability engineering [10,11].

Numerical example • Consider a grid service that uses six resources distributedin the grid system.

Numerical example • The entire service task can be divided into eight independent subtasks.

Numerical example • The amount of data in each input data block is presented in Table 4.

Numerical example • First the optimal task partition and distribution problemwas solved by the GA for formulation (9): • The solutions for different allowed service time y arepresented in Tables 5 and 6.

Numerical example • Table 5 contains obtained taskpartition into EB and their distribution among theresources

Numerical example • Table 6 contains minimal and maximalpossible service times, the service reliability and theconditional expected service time for each obtainedsolution.

Numerical example • Functions for the obtained solutions arepresented in Fig. 2. • It can be seen that the best solutionsobtained for certain y provide the greatest reliability forthis value of service time whereas for other values of ythey provide lower reliability than the solutions obtainedfor these values.

Numerical example

Adviser: Frank, Yeong -Sung Lin Present by Sean Chou