Comprehensive Guide to Discrete Event Simulation Techniques

Simulation Modeling • Imitation of the operation of a real-world process or system over time • Objective: to collect data as if a real system were being observed • Data collected from the simulation are used to estimate the performance/dependability measures of the system

Discrete Event Simulation • modeling of the system as it evolves over time by a representation in which the state variables change only at a countable number of points in time • terminology: • simulation clock: a variable that gives the current value of the simulated time • event: an instantaneous occurrence which may change the state of the system

Simulation Terminology • event list: a list (data structure) consisting of event records with each record containing the time of occurrence of a particular event, e.g., the arrival time, the departure time of a client • timing routine: a subroutine which determines and removes the most imminent event record from the event list and advances the simulation clock to the time when the corresponding event is to occur • event routine: a subroutine which updates the state of the system when a particular type of event occurs • one event routine for each type of event

Event Scheduling • Determine the number of event types in the system, e.g., 1: arrival, 2: request for service, 3: service completion, 4: timer, etc. • Place one or more initial event records in the event list, each containing • event time, event type, customer class, etc. • Determine the most imminent event in the event list (by the timing routine) in a loop until a specified stopping rule is satisfied • update the simulation clock when an event record is removed from the event list

Event Scheduling (cont.) • Pass the control to the event routine corresponding to the event type • Update the state of the system • Gather the statistics if necessary • Report the simulation results when the simulation is completed • For example • the average response time per client • the loss probability of calls • the system throughput • the average number of clients served over a time period

Simulation using smpl • In the smpl view of systems, there are three types of entities: • resources: facilities • smpl provides functions to define, request, release and preempt (queueing) facilities • tokens: active entities of the systems, e.g., tasks, users (indistinguishable or distinguishable) • events: a change of state of any system entity is an event • smpl provides functions for scheduling and for selecting events in the order of event occurrence time

Initialization routine; timing control routine to select the most imminent event from the event list (event clock is updated implicitly) { event type 1: event routine for event type 1; event type 2: event routine for event type 2; . . event type n: event routine for event type n; } statistics reporting routine; Structure of An smpl Program

smpl(m, s) int m=0; /* always 0 */ char *s; smpl provides seeds for 15 streams for generating random numbers. To collect a set of 15 sample values of a particular performance measure, one can invoke smpl() 15 times: loop: repeat 15 times { smpl(0, “hw1”); } One can also use stream(1), stream(2), etc. to specify the stream number to be used in a simulation run Initialization Routine

fd = facility(s, n) char *s; int n; /* # of servers */ => define a queueing server with “n” servers; smpl automatically manages enqueueing/dequeueing activities r = request(fd_id, token_id, pri) intfd_id; inttoken_id; intpri; => request a server of facility “fd_id” be reserved for the token designated by “token_id” with priority “pri” (higher is better) r=0: facility is reserved r=1: facility is busy and the request is blocked in the queue ordered on priority Facility Definition and Control

r = preempt(fd_id, tkn_id, pri) intfd_id, tkn_id, pri; => same as request() except that it will preempt the server if it is busy serving a task with priority < “pri” => the event record corresponding to the preempted token (for the service completion event) is removed from the event list and a queue entry with the residual time is created r=0: facility is reserved r=1: facility is busy and the request is blocked in the queue ordered on priority release(fd_id, tkn_id) intfd_id; inttkn_id; => release the facility and if the queue is not empty, reschedule an event with the event occurrence time at NOW for a blocked task, and reschedule an event with the event occurrence time at NOW+ the residual time for a preempted task. Facility Definition and Control create an event of the same type and put it in the event list

schedule(event_id, te, tkn_id) int event_id; real te; /* time interval relative to the current time */ int tkn_id; => schedule the event with id “event_id” to occur at NOW+te => this essentially inserts an event record with the event occurrence time NOW+te into the event list => part of the information in the event record is event_id, tkn_id and the event occurrence time NOW+te Example: schedule(2, 0.0, token_id) => schedule event type #2 associated with token id “token_id” to occur NOW Scheduling Events

cause(event_id, tkn_id) int *event_id; int *tkn_id; => remove the most imminent event from the event list and automatically advance the simulation clock to the event occurrence time => return the event number (type) and token id to the caller Typically in the smpl program, we use a select statement on the event_id returned, so as to transfer the control to the appropriate event routine. Timing Routine

cancel(event_id) int event_id; => search the event list and remove the first event with the event number = event_id Canceling Events Get Current Simulation Time t = real time() => return the current simulation clock value => real is a predefined type; it is the same as double in C

n= int inq(fd); => returning # of tokens currently in queue (not including the ones in service) r = int status(fd) => r=0: facility is free; r=1: facility is busy u = real U(fd) => mean # of tokens in service n = real Lq(fd) => mean # of tokens in queue excluding the ones in service b = real B(fd) => mean busy period = accumulated busy time/release counts Status Functions

r = real drand48(); /* available on UNIX machines */ => return r in the range of (0,1) r = real expntl(x) double x; => return an exponentially distributed sample value with mean x r = real uniform(a,b) double a,b; => return a real number r in the range of (a,b) k = int random(i,j) int i, j; => return an integer k in the range of (i,j) r = real normal(x,s) => return a normally distributed sample value with mean x and standard deviation s Random Variate Generation (rand.c)

trace(n) int n; => generate trace messages when a facility is defined, requested, or released, or whenever an event is scheduled or caused n=0: trace is off n=1; free-running, i.e., trace messages are generated continuously n=2: screen by screen running (press any key to resume tracing) n=3: message by message running (press any key to resume) Traces and Debugging

#include “smpl.h” main() { real Ta=200, Ts=100, te=200000; int customer=1, event, server; smpl(0, “M/M/1 Queue”); server = facility(“server”,1); schedule(1, 0.0, customer); while (time()< te) { cause(&event, &customer); switch(event) { case 1: /* arrival */ schedule(2,0.0, customer); schedule(1, expntl(Ta), customer); break; case 2: /* request server */ if (request(server, customer,0)==0) schedule(3, expntl(Ts), customer); break; case 3: /* completion */ release(server, customer); break; } } report(); } M/M/1 smpl program

Confidence Interval and Level • Suppose we collect N sample values Y1, Y2, …, YN from N simulation runs • sample mean Y = (Y1 + Y2 + …+ YN )/N • true mean is m • Define 1-a as the probability that the absolute value of the difference between the Y and m is equal to or less than H • that is, prob[ Y-H <= m <= Y+H] = 1- a Confidence interval half-width Confidence interval Confidence level

Confidence Interval and Level (cont.) • When Y1, Y2, …, YN are independent random variables from a normal distribution with the mean m, H is defined by H = ta/2;N-1* s/sqrt(N) where t is the student’s t distribution and s2 is the sample variance given by s2 = Si (Yi - Y)2 /(N-1) (and thus sis the standard deviation).

Batch Mean Analysis by smpl m m samples to generate Y1 samples to generate Y2 • Use a batch size m around 2000 observations to collect a sample value Yito justify the normal distribution assumption (by central limit theorem). • Delete d = 0.1 m initial observations • Collect k = 10 batches and compute the confidence interval half-width H • If the desired accuracy has not been reached, collect another batch and compute H again. Repeat as necessary.

BMA: stat.c and bmeans.c • Based on 95% confidence level (a = 0.05) with 10% confidence accuracy (H/Y = 10%) • The following three routines are provided: • init_bm(d, m): d is number of initial observations to be discarded and m is the number of observations to collect one sample Yi • obs(y): y is the observation value generated out of a simulation run • if the returning value is 1, it means that the required confidence level and accuracy have been reached; otherwise, need to continue calling this function obs(y) • civals(Y, H, k): Y, H and k are passed in by reference. This function returns the final result.

M/M/1 smpl program with BMA case 1: /* arrival */ ts[customer] = time(); schedule(2,0.0, customer); if (++tk_id >= TOKENS) tk_id=0; schedule(1,expntl(Ta),tk_id); break; case 2: /* request server */ if (request(server, customer,0)==0) schedule(3,expntl(Ts),customer); break; case 3: /* release server */ release(server, customer); if (obs(time()-ts[customer])== 1) cont = FALSE; break; } } /* end while */ civals(&mean, &hw, &nb); printf(”Y= %f; H= %f after %d batches\n”, mean, hw, nb); } #include “smpl.h” #define TOKENS 1000 #define TRUE 1 #define FALSE 0 main() { real Ta=200.0,Ts=100.0,mean,hw; inttk_id=0,customer=0,event,server,nb; real ts[TOKENS]; /* start time stamp */ intcont=TRUE; smpl(0,"M/M/1 Queue with BMA"); init_bm(200,2000); /* d=200; m=2000 */ server=facility("server",1); schedule(1,0.0,tk_id); while (cont) { cause(&event,&customer); switch(event) {

Bmeans.c printf("batch %2d mean = %.3f",k,smy); smy=0.0; n=0; /* reset batch variables */ if (k>=10) then { /* compute grand mean & half width */ Y=smY/k; var=(smY2-k*Y*Y)/(k-1); h=T(0.025,k-1)*sqrt(var/k); printf(", rel. HW = %.3f",h/Y); if (h/Y<=0.1) then r=1; } printf("\n"); } return(r); } civals(mean,hw,nb) real *mean,*hw; int *nb; { /* return batch means analysis results */ *mean=Y; *hw=h; *nb=k; } #include "smpl.h" #include "stat.c" static intd,k,m,n; static real smy,smY,smY2,Y, h; init_bm(m0,mb) int m0,mb; { /* set deletion amount & batch size */ d=m0; m=mb; smy=smY=smY2=0.0; k=n=0; } obs(y) real y; { int r=0; real var; if (d) then {d--; return(r);} smy+=y; n++; if (n==m) then { /* batch complete: update sums & counts */ smy/=n; smY+=smy; smY2+=smy*smy; k++;

Chap 2: Reliability and Availability Models Reliability R(t) = prob{S is fully functioning in [0,t]} Suppose from [0,t] time period, we measure out of N components, of which N0 (t): # of components operating correctly at time t Nf (t): # of components which have failed at time t Physical meaning: instantaneous rate at which components are failing

Failing rate of a single component How many unfailing components are there at time t ? N Nf(t+dt) - Nf(t) dt t time t+dt (2.3) text p.28 Z(t) is also called the hazard rate.

For electronic components, Z(t)’s relationship with respect to time is a bathtub curve. Burn-in period Z(t) High failure rate due to faulty design, manufacturing or assembly. Useful time phase Wear-out phase (due to aging) Infant mortality phase Weak components are removed during the “burn-in” period. Failure rate can be assumed to be a constant during the useful time period, say  time

Exponential Failure Law e.g.  = 0.01 hr-1, what is R(t) at t = 100 hrs? Ans: e-0.01*100  For hardware components, exponential failure law is frequently assumed.  For software components, the reliability may grow as the software’s design faults are removed during the testing/debugging phase.

In general, we can assume Z(t) =  ( t )-1 Weibull dist.  R(t) = e- (t ) Z(t) Use this to model software failure rate >1 <1 e.g.  = -1 R(t) = =1 -1 t et This yields  t  , R(t)  1 t  0, R(t)  0  Reliability improves as a function of t

Formal Definition of R(t): Let x be a .. representing the life of a system and let F be the cumulative distribution function (CDF) of x. Then,  For a component obeying the exponential failure law

Mean Time to Failure (MTTF): The expected time that a system will operate before the first failure occurs i first failing time 1 t1 Identical systems N tN Discrete case Continuous case failure time Q: what is the reliability of a system obeying the exponential failure law at t = MTTF? Ans:

MTTR (Mean Time to Repair) If also assume a failed system obeys “Exponential Repair Law”, then Relationship between MTBF (Mean time between failure) , MTTR & MTTF: time MTTF MTTR MTTF MTTR Time of 2nd failure Time of first failure MTBF = MTTR + MTTF  If MTTF >> MTTR Then MTBF  MTTF

Availability Instantaneous (or point) Availability A(t) = prob {the system is functioning at time t} Regardless of the # of times it may have failed & been repaired during [0, t] A(t) A time t 0 Steady-State Availability = MTTF MTTF+MTTR  For a system without repair, A(t) = R(t)

Assume exponential “failure” & “repair” law  See also page 67, text chapter 4 O F  Time domain: Po’(t) = - Po(t) + PF(t) PF’(t) = Po(t) - PF(t) with initial state Po(0) = 1 & PF(0) = 0 Laplace domain: SPo(S) - 1 = - Po(S) + PF(S) SPF(S) - 0 = Po(S) - PF(S) Po(0) PF(0)

LT F(t) L(F(t)) = f(S) Inverse LT 1/S 1/S2 n!/Sn+1 1/(S-a) 1 t tn eat   Po(t) = + e-(+)t + + Physical meaning: Po(t) = prob {the system is functioning at time t }   PF(t) = - e-(+)t + +   + e-(+)t  A(t) = + +  Po(S) Similarly PF(S) Inverse LT to return to time domain

e.g.  = 0.01 &  = 0.1 1.0 0.9 0.90909... A(t) t Q1: unavailability? Ans: PF(t) Q2: R(t)? Still e-t Q3: Steady-state availability?  Ans: A( t) = +

Modeling: Series-Parallel Reliability Block Diagrams  A series-parallel block diagram represents the logical structure of a system with regard to how the reliabilities of its components affect the system reliability.  Components are combined into blocks in  series  parallel or  k-of-n configurations

1 1 2 n A. Serial system: each element of the system is required to function correctly for the system to function correctly. 1 2 3 n B. Parallel system: only one of several elements must be operational for the system to be operational. Assumptions: independent random variables perfect coverage so up to n-1 failures can be tolerated. 2

Computer 1 Interface 1 Display 1 Bus 1 Computer 2 Interface 2 Display 2 Bus 2 Parallel 1 2 3 4 Where Rj,i is for jth component of ith unit C. Combination of series & parallel systems e.g. Numerical ex: R=0.9 then Rsystem = [1-(1-0.9)2]4 = 0.96 v.s. Rnon-redundant = (0.9)4 = 0.6561

e.g. R1 R1 1 R2 R2 R2 2 R3 3 Rparallel = 1 - ( 1 - Rseries, 1 )( 1 - Rseries, 2 )( 1 - Rseries, 3 ) = 1 - ( 1 - R12 )( 1 - R23 )( 1 - R3 ) Q: Prove the following theorem: Replication at the component level is more effective than replication at the system level in improving system reliability using the same # of components. Ans: Show that is better than Assume R=1/2 for each component

D. k-out-of-n e.g. TMR ( Triple Module Redundancy ) is a 2-out-of-3 system. RTMR(t) = R1(t) * R2(t) * R3(t) + R1(t) R2(t) ( 1 - R3(t) ) + R1(t) R3(t) ( 1 - R2(t) ) + R2(t) R3(t) ( 1 - R1(t) ) when R1(t) = R2(t) = R3(t) = R(t) R2-out-of-3(t) = 3R2(t) - 2R3(t) all are functioning 1 failed & 2 are functioning In general, 3 e.g.

3 2  R2 - R + 0.5 = 0  R = 1.0 or 0.5 n n ! å - l - l - = - t i t n i R ( t ) ( e ) ( 1 e ) system - ( n i )! i ! = i k ¥ æ ö 1 1 1 ò \ = = + + ç ÷ MTTF R ( t ) dt ... system l n k è ø 0 Rsystem RTMR(t) Q: Is RTMR(t) > Rsystem with a single component(t) ? 1.0 Let 3R2 - 2R3 = R R(t) single RTMR(t) = R(t) when the reliability of a single module is 1.0 or 0.5 More realistic region In fact, when R(t) < 0.5, R(t) > RTMR(t) R(t) 1.0 0.5 Q: what is the MTTF of a k-out-of-n system when each single module follows the exponential failure law with a failure rate of  ? e.g. 2-out-of-3: MTTF= 2-out-of-5: MTTF=

m1 P1 m2 P2 m3 MTTF = 6 0.00903 3 0.01042 6 0.01667 3 0.01806 2 0.02431 1 0.0257 - - - = = + + 226.09 Q: what is the reliability & MTTF of the following structure? Fig. 2.5, p36 A parallel system too Also a parallel system m = 0.00139/day ( 1-out-of-3 ) p = 0.00764/day ( 1-out-of-2 ) Rsystem(t) = [ 1 - ( 1 - e-0.00764t )2 ][ 1 - ( 1 - e-0.00139t )3 ] = 6e-0.00903t-3e-0.01042t-6e-0.01667t+3e-0.01806t+2e-0.02431t-e-0.0257t  Recall that Repair rate Failure rate of component i Ai(t) = Equations & obtained above can also be used to compute 1 2 3 the system availability  by replacing Ri(t) with Ai(t)

4 5 6 Assuming For steady state availability i i + i Use Ai( t   ) = into equations above & 4 5 6 Specifically,

C D A B C E D C block name {( param-list )} optional < block line > end can be one of the following: Chap 9: Reliability & Availability Modeling Reliability Block Diagramusing sharpe P.358 Appendix B

2) parallel name name-1 name-2 {name3 name4 …} Name 1 The parallel system is assigned to the first name. Name 2 3) series name name-1 name-2 {name3 name4 …} 4) kofn name expression-1, expression-2, component-name The series system is assigned to the first name. Name 1 Name 2 k n identical components 1) comp name exponential-polynomial Referring to the cumulative dist function ( cdf ) F(t) = 1-e-t R(t) = e-t exp (lambda) meaning cdf (component-name{,state}{;arg-list}) which has been defined before gen triple 1, triple 2, ... in the form of (aj, kj, bj) or See p. 352

Ex: m1 Sharing memory: a k-out-of-n device CPU1 Fig. 2.5 (p. 36) m2 CPU2 m3 Output CDF for system 1-6e-0.00903t+3e-0.0104t+… (1-out-of-3) k=1.00 mean(system;k,3,…) = 2.26*102 rel (10,k,3…) = 9.9981*10-1 rel (365,k,3…) = 8.33*10-1 k=2.00 in semi symbolic form 5) kofn name k-expression, n-expression, name1 name2 {name3…} representing a k-out-of-n system having possibly different components Components do not have identical failure-time dist. block system ( k, n, pfrate, mfrate ) comp CPU exp ( pfrate ) comp mem exp ( mfrate ) parallel CPUs CPU CPU kofn mems k, n, mem series subsystem CPUs mems end

Comments: any line starting with the symbol “*”  printing output.  printing F(t) in symbolic form cdf (system;1, 3, 0.00139, 0.00764)  define reliability function at time t func rel(t, k, n, pf, mf)\ 1-value (t; system; k, n, pf, mf) \ means continuation to the next line Returning F(t) at time t numerically loop k, 1, 3, 1 exprmean (system; k, 3, 0.00139, 0.00764) exprrel (10, k, 3, 0.00139, 0.00764) exprrel (365, k, 3, 0.00139, 0.00764) end end

=0.3 =0.05 C =0.1 D A B C E =0.01 D C =0.25 block block1a comp A exp(0.05) comp B exp(0.01) comp C exp(0.3) comp D exp(0.25) comp E exp(0.1) parallel threeC C C C parallel twoD D D series sys1 A B threeC twoD E end Fig. 9.1 p. 156  printing: 5 decimal places format 5  cdf cdf (block1a) expr 1-value(10; block1a) end Print 1-F(t)=R(t) at t=10

Availability Modeling See p.354 text on a user-defined distribution syntax: poly name(param-list) dist. gen\ triple\ triple of the form (aj, kj, bj ) polyunavail(mu, lambda)\ gen\ 1, 0, 0\ -mu/(lambda+mu), 0, 0\ -lambda/(lambda+mu), 0, -(lambda+mu) block block1a comp A unavail(muA, lambdaA) comp B unavail(muB, lambdaB) . . . end bindmuA 1 bindlambdaA0.0001 . To define 1-Ai(t)  print steady state  availability A() expr pinf(block1a)  print instantaneous  availability at t=100 expr 1-value(100; block1a) end

e.g. failure or M1 M2 M3 P1 P2  AND AND M1 M2 M3 P1 P2 Fault Trees P.39, chap.2  A pictorial representation of events that can cause the occurrence of an undesirable event.  An event at a high level is reduced to a combination of lower level events by means of logic gates AND: when all fail, the failure event occurs. OR: when one fails, the failure event occurs. K out of n: when at least k out of n components fail, the failure event occurs.

Comprehensive Guide to Discrete Event Simulation Techniques