Enhancement of Reliability and Dynamic Load Balancing for Distributed Parallel Computations.

Enhancement of Reliability and Dynamic Load Balancing for Distributed Parallel Computations. Galyuk2 Yu.P., Memnonov1 V.P., and Zolotarev2, V.I. 1St.Petersburg State University,Inst.Math.Mech.,Universitetskij pr.28, St.Petersburg, 198504,Russia, pokusa@star.math.spbu.ru 2Petrodv. Telecommun. Center, Ulyanovskaya st.1, St.Petersburg, 198504, Russia.

ABSTRACT For distributed multicluster computations with connectionsthroughmetacomputing or Grid one is always confronted with a problem,related to trouble-free operation of somebody else, remote hardware technique.In the paper this reliability problem was resolved with the help of purely algorithmic procedures developed by us for distributed computions under MPI, which provide fault detection and fault management completely within theapplication. Additionally by monitoring all stages of computations at all the clusters and also nodes inside of them, the dynamic load balancing of thewhole computational system is achieved through automatic redistribution ofthe load from late coming processors to the others. The paper contains description of these algorithms and some evaluation of their performance andcost. They are shown to produce an enhancement of the reliability forMonte Carlo simulations, only partly diminishing the total statistical sample in the case of a computer breakdown in any cluster. They could be also built in some other user programs. If in user program a checkpointing is provided, the sample losses are diminished considerably by implementing appropriate for this case modification of our fault tolerant procedures which are alsodescribed in the paper. Dynamic load balancing system developed in the paperreduces noticeably the idling time of processors and even the clusters.

Introduction Various Gridprojects such as "Damien", "Nordugrid" or "Crossgrid" are very good examplesof growing importance for such distributed computations. But at the same time some specific issuesrelated to trouble-free operation of a somebody else, remote hardwaretechnique come into play demanding development of adequate precautions.First of all in this connection it should be noted thatGlobus tool "Heartbeat monitor" (http://www-unix.globus.org/),which was originally designed for monitoringGlobus system processes exclusively, has been expanded to allow simultaneousmonitoring of both Globus system processes and application processesassociated with user computations. Some more advanced system of this kindwas also suggested in D.W.Lee et al.[1]. Yet no faultmanagementservices were supplied in the both of them. On the other hand in the nearestfuture a middleware OptimalGrid of IBM (http://www.alphaworks.ibm.com/tech/optimalgrid/) seems toappear which is promised to do something with system fault tolerance andrecovery among many other things. Though in order to make use of it the userapplication must be as yet rewritten in jawa. It should be mentioned alsoin this connection the papers of V.Alexandrov et al. [2] andJ.M.D.Hill et al.[3].Now there have appeared more general projects like the Fault Tolerant MPIof the Tennessy University and other general solutions:S.Louca et al. [4], H.M.Lee et al.[5]. Yettheyseem to be not quite ready up to now for user practice and it may alsohappen that they will be too expensive demanding superfluous computing andcommunication resources. In the present paper, with the hope to reach less expensive solutions,the fault tolerant issues for multicluster computational systems are beingsolved by automatic algorithmic procedures built incompletely within the application for though specific but rather large classof Monte Carlo simulations.

Monte Carlo simulations • Large statistical scattering producing the error rN is a standard difficulty in all applications of the Monte Carlo simulations but it is especially severe for flows with small values of mean gas velocities relative to the thermal molecular speeds. The statistical scattering rN in our filter problem, which is of this kind, with the variance being D(u)=0.5VT2, has been strongly diminished • rN = x[D(u)/N]1/2(1) • by enlarging our computer number and the samples Nby using severalparallel clusters, sometimes through Internet-connection several clusters into a metacomputer. In this case separate parallel MPI implementations of the DSMC method forthis problem were employed on three clusters, which were operated underdifferent nets: 1 0 0 M b / s Ethernet, 1 G b / s SCI and1 G b / sMyrinet. All of themhad distributed memory, the two being linked into a local net by 1 0 0 Mb / s channel, the Internet connection with the third cluster "Paritet" ofInstitute for High Performance Computing and Data Base (P-cluster) wasrealized through 1 M b / s Internet channel.Each processor produced independent realizations, the results then were sentto the corresponding leading one and finally gathered at one of them foraveraging and outputting. Special dynamic load balance technique under our distributed memory conditions was developed for maintaining an optimal performance of each processor being at our disposal. The special dynamic load technique wasdeveloped Yu.Galyuk et al.[8]. in order to improveefficiency of processor's employment.Though quantitative performance estimate confirmed a very good efficiencyof this technique, it appearedto be unable to do anything with computer breakdowns in the clusters, whichwere not so seldom. Therefore the other technique was developed to diminishcatastrophic consequences of such accidents for simulations.

Problem description We consider the flow through two-dimensional infinite in Z-direction channel which connects two reservoirs with the same gas, the particle densities of them being n1 on the left and n2=0.8n1 in the right. Molecular interaction assumed to be the hard sphere one and their interaction withthe channel walls is taken to be diffuse reflection. At the both of the reservoirs temperatures and Maxwell velocity distribution functions were considered to be the same and not perturbed by outgoing streams. With the help of DSMC method we were computing profiles of the mean velocity and density inside the channel and some other parameters.

Testing of Pseudorandom Sequences via Numerical Simulation of a Problem with Known Solution. • Statistical Monte Carlo methods are now widely used for numerical simulation of different problems in science and engineering. An important part of these methods constitutes usage of pseudorandom sequences for simulation of different statistical distributions in the programs and outcomes of some events. Special computer programs usually obtain these sequences and thus they only approximate true random sequences. Quality of this approximation is checked by different local and global tests [9]-[10]. Yet the positive results are never exhaustive. So it is always advised to check them additionally by numerical simulation of a problem with exact solution in particular application field. In the present paper we have used for this purpose numerical simulation of the energy redistribution between translation and rotational degrees of freedom of molecules through their collisions in a volume with condition of mirror reflection on the walls. We produced parallel version of the source program of Bird [11] and first of all compared linear multiplicative congruential generators • Zi+1=A*Zi (modulo M) • with different multipliers A and period M=2^31 - 1. And the other one with very large period 2^126 and M=2^128, A=5^100109. • In the final state departures of rotational and translational temperatures from known equilibrium values may exist but only because of limitations on the size of statistical sample N used. This sample departure error is evaluated by abovementioned Monte Carlo expression (1). So if obtained in simulations temperature departures larger than this value then the generator utilized is bad with probability 0.997 .

Computer Breakdown Resistant Multicluster DSMC Simulations Our technique includes two algorithmic procedures, which are displayed as a scheme in Fig.1 because of space economy only for two mainoutputting clusters and explained in what follows. First for monitoringthe state of anyprocessor participating in aparticular Monte Carlo simulation it is necessary to know the information onthe nodes obtained in all of the clusters.So every script starting the simulation on acluster simultaneously transmit this node information to the others. It isshown by arrow indications on the top in the scheme in Fig.1.In the present parallel program every processor receives task niaccording to its capacity and after producing ni independentrealizations the monitoring service comes into play. For it to beginany processor of either of the two leading clusters after completing itstask opens the special control file TERM.myrank simultaneouslydetermining with the help of the command TERM.* wc -l .. the quantityof already opened such files by the others, in order to find out whether heisthe latest in this cluster. And in the latter case it checks through systemcommand $\it fping$ the current state of processors in all of the otherclusters participating in this problem solution. Thus communication cost ofthis monitoring is equal to O(2K), with K being the total number ofcomputers. If it discovers a computerbreakdown then our fault management procedure comes into force by directingthrough its system commands the further simulation development into the emergencyalgorithmic branch of an application which disconnects all the relationsof the "misfortune" cluster and the simulation since then does notdepend upon the possible results of the latter. In the normal casefinal results of all clusters are copied on two mainclusters for averaging and outputting. This doubling is again for sake ofreliability enhancement. The simulations of the above mentioned filterproblemconfirmed the correct operation of thealgorithms in both of the normaland emergency cases.

Quantitative evaluation of the presented technique In order to get a quantitative evaluation for the performance ofthe presented schemeit is desirable to havesome quantity which combines in itselfboth the Monte Carlo error (1) and the computational cost ofthe corresponding simulation. So let us consider the inverseerror I e r = 1 / r_N. This value may be understood as an amountof C_jthe quantitative uncertainty reduced by the carried out simulationwhich has the computational cost C_j = P Q t_j used up for that purpose.HerePis the number ofthe clusters. The latter for simplicity are now consideredhaving equal performances,Q - the number floating point operationsper second and, finally, t_j is the total time of problem simulation.Let us introduce also a specific cost S C_j = C_j/ I e r orS C_j = C_j r_N ,which is the computational cost for reduction of one quantitativeuncertainty unit. Observing that the sample N now could be expressedas N = P M, where M is a sample quantity produced by onecluster,we have, by making use of expression (1), S C_j = P^{1/2} t_j L withindependent on P quantityL = x Q ( D ( u ) / M )^{1/2}. So that if onewants in spite of one possible computer failure to get the accuracy not lessthan r_N then it is enough to take, for instance in the Grid, P + 1clusters. Thus increasing the specific cost upon d S C_j = t_j L / 2 P^{1/2}, which is the difference between its values for P + 1 and P, andfor the corresponding relative increase {d S C_j}/{S C_j}, oneobtains dependence 1/{2P}. In order to appreciate this result properlyit is usefull, for instance, to click at http://www.nordugrid.org/monitor/ and at the LDAP server to see the computational resources employed by nordugrid users. The most of the cluster resources occupied by its nativeusers and only small part is employed by the users of Nordugrid.Thus for obtaining even as small as 50 processors the nordugrid user shouldstart up his application possibly at 5 different clusters of Nordugrid.Then relative cost increase of employing our reliability algorithm is quitelow. And more then that, will any cluster administrator much bother in thecase of computer breakdown among the small cluster part given for Grid and restart the whole cluster with large scale migration procedures as it isdemandedunder employing, for example, Dynamit system, which though is notready for MPI? So reliabilty issues for Grid is a problem for user himself!

Modification of the scheme for checkpointing employment In our simulation one processor produces many independent realizations ofthe problem solution so that between two checkpoints it does several ofthem.And in this case we have developed modified procedureswhich, being built in user program, save as usually forcheckpointing ona disk at the cluster the intermediate computational resultsin a special file and in a failure event at that cluster this file isautomatically copied to the two main outputting clusters by ourfailure management system commandsand afterwards profitably used for the total sample enlargement.The fragment of this modified technique is representedin the Fig.2. In this case each processor of thecluster as in previous scheme carries out its task. But now the leader ofeach clusterwith myrank equal to 0, after the MPI-Reduce operation at thecheckpointwhich simultaneouslysynchronizes the work of the processors, saves results obtained for current interval on disk. This is shown in the second rectangle. Then at this checkpoint the monitoring procedure begins and proceedsas before. This is shown in the Fig.2 by arrow indications atthe lowest boxes. If a computer breakdown is detected somewhere, thoseaccumulated before thefailure on the disk results are automaticallytransferred from a "misfortune" cluster to the two main outputtingclusters by their failure management system commands. They are added inthose clustersto their already accumulated results and thus substantially reducingthe failure consequences.

Scheme fragment for checkpointing

Dynamical load balancing for distributed computations Distributed computations are often realized on several parallel clusterswith quite different performances and running maintenance. This isparticularly so for Grid computations. Thus initial static load balancingmay not be sufficient and after some perturbations of performance regimeswaiting idling times may be toolong and prohibitive when interconnectionswith exchange of results between clusters are present in the application. Inorder to prevent such cases it was developed special algorithmic systemwhich includes saving the information onconsecutive stages of their computational jobs in special files at all the clusters, monitoring it in appropriate occasions and automatically redistributing theremaining load if it needs to be done. Really in Monte Carlo simulations one produces large number ofindependent realizations of the same processes. Thus, for example,thenumber ofrealizations $R_i$ , which remains to be carried out for particularcluster i or corresponding quantity n_j forprocessor j, are natural measures of their computationaljob progress.So, when the most quickest processor k finishes its initial task, thenby simple sorting among n_j remained realization values for the other processors it establishes the mostlate processor among the latter with n_{mx} value and takes on automaticallypart of its load. At the same time simultaneouslyby adding all n_j values and writing their sum S_i into special fileit prepares the information for load balancing among the clusters themselves ata neighboring checkpoint. The second quickest processordoes the same and so on, by keeping the data fresh. Through introducing somethreshold value m_{tr}, forwhich some least critical value of backwardness could be tolerable, onecan express this as the following algorithm:

Algorithm for dynamic load balancing if n_{mx}greater then m_{tr} then for most quickest processor the new load of n_k equals to the difference n_{mx} - m_{tr} and with this load n_k this processor returns to computations else it comes into a waiting state until all other processors will finish their jobs

Monitoring of starvation Here one monitors also the possible starvation behavior of some processor.This means that some processor may respond on the monitoringcommand fping, but at the same time produces nocomputations.The simplest way to define this state quantitatively is to check aftersorting for the most late processor the condition if n_{mx} much greater then m_{tr} then there is a starvation case. In that case we disconnect all the relations inside this cluster to this processor if it is possible. If it is not possible, for example, because the identfier MPI-COMM-WORLD is not as yet dynamically changeable, then it is better to disconnect all the relations to this cluster andthe simulation since then will notdepend upon the possible whole results of it. If there is no starvation then the second quickest processor goes through the same procedure and so on. For dynamical load redistribution among theclusters we do the same withthe help of S_jvalue and a suitable threshold value M_{tr}.This way the flattening of the distributions for finishing times of the allthe processors inside a cluster and clusters themselves is achieved.

Conclusion. The presented algorithmic procedures, when being built in a user program,produce an enhancement of the reliability for distributed statisticalMonte Carlo simulations. The specialmodification of these procedures was developed when theuser application is supplied with a checkpointing. The quantitativeevaluation of the performance and communication cost for the describedalgorithms has shown its usefulness for distributed computing on Grid.An additional algorithmic procedure was developed also for dynamic loadbalancing of distributed computations, producing the flattening ofthe distributions for finishing times of allprocessors inside a cluster and the clusters themselves

2. Computer breakdown resistant load balancing. • It should be mentioned that presented above DLB technique was tested on clusters which had processor numbers pnot large, p<12. However for clusters with large number of processors it displays certain limitations. First of all a bottleneck effect seems to appear. When many processors nearly simultaneously try to open the same control file they inevitably fall into idle queue. We have found that already with p=32 the bottleneck exhibits itself by 10% time lag while executing some fixed program in comparison with simple static load balancing. So for the case p>>1 a new load balance technique was developed which avoids the bottleneck by combining an initial static load balancing with the dynamic one. The latter is accomplished by monitoring the appearance or not appearance of certain files and by utilizing system commands for their quick recording. The same technical tricks allow us to diminish crucial consequences of a computer breakdown in one of the clusters by automatically monitoring this event and disconnecting the latter from the others. Thus the whole problem is not interrupted and stopped, only the total statisticalsample beingdiminished by the portion initiallyintended for the misfortune cluster.

EfficiencyEpand mean timesTavin secondsupon the computer numberp ap– solid, mp – dashed.mp – mean time of one realisationunder metacomputing with DLB

. Quantitative performance estimate for the first DLB Technique • For this purpose we have used comparison of the values for mean average execution time tavof a fixed simulation problem realization. First we have calculated tav and corresponding efficiency E for its execution at the cluster “Paritet” alone by using different numbers of processors p. In the following slide on the left the efficiency E is represented by the solid curve with the E-axis in percents, the solid curve there on the right shows corresponding times tavwith the axistav in seconds. Then we have connected several clusters by our DLB technique and measured these times for metacomputing tmvfor the same cluster again, including of course time for final averaging and outputting. Result is represented by the dashed curve which is only slightly, by several percents, overshoots the solid one. This means that the efficiency of the processor performance utilization in “Paritet” by our DLB technique is likewise only the same percents lower.

Scheme fragment for checkpointing

Enhancement of Reliability and Dynamic Load Balancing for Distributed Parallel Computations.