CPSC 531:Input Modeling

CPSC 531:Input Modeling Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB 101 Lectures: TR 15:30 – 16:45 hours Class web page: http://pages.cpsc.ucalgary.ca/~mahanti/teaching/F05/CPSC531 Notes from “Discrete-event System Simulation” by Banks, Carson, Nelson, and Nicol, Prentice Hall, 2005, and “Simulation Modeling and Analysis” by Law and Kelton, McGraw Hill, 2000. CPSC 531:Input Modeling

Outline • Quality of output depends on the input models driving the simulation • This module discusses the following: • Data collection from the real system • Hypothesize probability distributions • Choose parameters for the distributions • Goodness of fit test – how well does the fitted distribution model available data • Selecting distributions in absence of data • Models of arrival process (Poisson Process, Non-stationary Poisson Process, Batch Arrivals) CPSC 531:Input Modeling

Data Collection • Plan ahead: begin by a practice or pre-observing session, watch for unusual circumstances • Analyze the data as it is being collected: check adequacy • Combine homogeneous data sets, e.g. successive time periods, during the same time period on successive days • Be aware of data censoring: the quantity is not observed in its entirety, danger of leaving out long process times • Check for relationship between variables, e.g. build scatter diagram • Check for autocorrelation • Collect input data, not performance data CPSC 531:Input Modeling

Identifying Probability Distributions • Several possible techniques (may use a combination of these) • Prior knowledge of the random variable’s role • Inter-arrival times are exponential if arrivals occur one at a time, have constant mean rate, and are independent • Service times are not normally distributed because service time can not be negative • Product of many independent pieces may imply Lognormal • Use the physical basis of the distribution as guide • Summary statistics • Histograms CPSC 531:Input Modeling

Distribution Guide • Use the physical basis of the distribution as a guide, for example: • Binomial: # of successes in n trials • Poisson: # of independent events that occur in a fixed amount of time or space • Normal: dist’n of a process that is the sum of a number of component processes • Exponential: time between independent events, or a process time that is memoryless • Weibull: time to failure for components • Discrete or continuous uniform: models complete uncertainty • Triangular: a process for which only the minimum, most likely, and maximum values are known • Empirical: resamples from the actual data collected CPSC 531:Input Modeling

Summary Statistics CPSC 531:Input Modeling

Histograms • A frequency distribution or histogram is useful in determining the shape of a distribution • The number of class intervals depends on: • The number of observations • The dispersion of the data • Suggested: the square root of the sample size • For continuous data: • Corresponds to the probability density function of a theoretical distribution • For discrete data: • Corresponds to the probability mass function • If few data points are available: combine adjacent cells to eliminate the ragged appearance of the histogram CPSC 531:Input Modeling

Histogram for Continuous Data • Sample n = 100 interarrival times of requests to a Web server during a 1-minute period (see Web page) • Request arrival approximately stationary – # of requests arriving in 10-second periods approx. equal • Sample mean = 0.534 s; median = 0.398; CV = 0.98 • exponential distribution? • Right-hand side shows two histograms: top with interval or bin size of 0.1 s; bottom with bin size of 0.25 s. CPSC 531:Input Modeling

Histogram for Discrete Data • Sample n = 100 observations of the number of items requested from a job shop per week over a long time period • (# req., # observations): {(0,1), (1,3), (2,8), (3,14), (4, 18), (5,17), (6,16), (7,10), (8,8), (9,4), (10, 1)} • Mean = 4.94, variance = 4.4, Lexis ratio = 0.9 • Poisson distribution? CPSC 531:Input Modeling

Parameter Estimation • Next step after selecting a family of distributions • If observations in a sample of size n are X1, X2, …, Xn (discrete or continuous), the sample mean and variance are: • If the data are discrete and have been grouped in a frequency distribution: where fj is the observed frequency of value Xj CPSC 531:Input Modeling

Parameter Estimation • When raw data are unavailable (data are grouped into class intervals), the approximate sample mean and variance are: where fjis the observed frequency of in the jth class interval mj is the midpoint of the jth interval, and c is the number of class intervals • A parameter is an unknown constant, but an estimator is a statistic. CPSC 531:Input Modeling

How Representative are the fits? • Continuous data – plot over the histogram and look for similarities • Discrete data – compare the observed frequency with the expected frequency • Try a Quantile-Quantile Plot Fitted Dist Observed CPSC 531:Input Modeling

Quantile-Quantile Plots • Q-Q plot is a useful tool for evaluating distribution fit • If X is a random variable with cdf F, then the q-quantile of X is the g such that • When F has an inverse, g = F-1(q) • Let {xi, i = 1,2, …., n} be a sample of data from X and {yj, j = 1,2, …, n} be the observations in ascending order: where j is the ranking or order number CPSC 531:Input Modeling

Quantile-Quantile Plots • The plot of yjversus F-1( (j-0.5)/n) is • Approximately a straight line if F is a member of an appropriate family of distributions • The line has slope 1 if F is a member of an appropriate family of distributions with appropriate parameter values CPSC 531:Input Modeling

Quantile-Quantile Plots • Example: Check whether the door installation times follows a normal distribution [BCNN05] • The observations are now ordered from smallest to largest: • yj are plotted versus F-1( (j-0.5)/n) where F has a normal distribution with the sample mean (99.99 sec) and sample variance (0.28322 sec2) CPSC 531:Input Modeling

Quantile-Quantile Plots [BCNN05] • Example (continued): Check whether the door installation times follow a normal distribution. Straight line, supporting the hypothesis of a normal distribution Superimposed density function of the normal distribution CPSC 531:Input Modeling

Quantile-Quantile Plots[BCNN05] • Consider the following while evaluating the linearity of a q-q plot: • The observed values never fall exactly on a straight line • The ordered values are ranked and hence not independent, unlikely for the points to be scattered about the line • Variance of the extremes is higher than the middle. Linearity of the points in the middle of the plot is more important. • Q-Q plot can also be used to check homogeneity • Check whether a single distribution can represent both sample sets • Plotting the order values of the two data samples against each other CPSC 531:Input Modeling

Goodness-of-Fit Tests[BCNN05] • Conduct hypothesis testing on input data distribution using: • Kolmogorov-Smirnov (KS) test • Chi-square test • No single correct distribution in a real application exists. • If very little data are available, it is unlikely to reject any candidate distributions • If a lot of data are available, it is likely to reject all candidate distributions CPSC 531:Input Modeling

Chi-Square Test[BCNN05] • Compare histogram of the data to the shape of the candidate distribution function • Valid for large sample sizes when parameters are estimated by maximum likelihood • Arrange the n observations into a set of k class intervals or cells, the test statistics is: which approximately follows the chi-square distribution with k-s-1 degrees of freedom, where s = # of parameters of the hypothesized distribution estimated by the sample statistics. Expected Frequency Ei = n*pi where pi is the theoretical prob. of the ith interval. Suggested Minimum = 5 Observed Frequency CPSC 531:Input Modeling

Chi-Square Test • Null hypothesis – observations come from a specified distribution cannot be rejected at a significance of α if: • Comments: • Errors in cells with small Ei’s affect the test statistics more than cells with large Ei’s. • Minimum size of Ei debated: [BCNN05] recommends a value of 3 or more; if not combine adjacent cells. • Test designed for discrete distributions and large sample sizes only. For continuous distributions, Chi-Square test is only an approximation (i.e., level of significance holds only for n->∞). Obtained from a table CPSC 531:Input Modeling

Chi-Square Test • Example 1: 500 random numbers generated using a random number generator; observations categorized into cells at intervals of 0.1, between 0 and 1. At level of significance of 0.1, are these numbers IID U(0,1)? CPSC 531:Input Modeling

Chi-Square test[BCNN05] • Example 2: Vehicle Arrival H0: the random variable is Poisson distributed. H1: the random variable is not Poisson distributed. • Degree of freedom is k-s-1 = 7-1-1 = 5, hence, the hypothesis is rejected at the 0.05 level of significance. Combined because of min Ei CPSC 531:Input Modeling

Chi-Square Test • If the distribution tested is continuous: where ai-1 and ai are the endpoints of the ith class interval and f(x) is the assumed pdf, F(x) is the assumed cdf. • Recommended number of class intervals (k): • Caution: Different grouping of data (i.e., k) can affect the hypothesis testing result. CPSC 531:Input Modeling

Kolmogorov-Smirnov (KS) Test • Difference between observed CDF F0(x) and expected CDF Fe(x) should be small; formalizes the idea behind the Q-Q plot. • Step 1: Rank observations from smallest to largest: Y1 ≤ Y2 ≤ Y3 ≤ … ≤ Yn • Step 2: Define Fe(x) = (#i: Yi ≤ x)/n • Step 3: Compute K as follows: CPSC 531:Input Modeling

KS Test • Example: Test if given population is exponential with parameter β = 0.01; that is Fe(x) = 1 – e–βx; K[0.9,15] = 1.0298. CPSC 531:Input Modeling

KS Test • KS test suitable for small samples, continuous as well as discrete distributions • KS test, unlike the Chi-Square test, uses each observation in the given sample without grouping data into cells (intervals). • KS test is exact provided all parameters of the expected distribution are known. CPSC 531:Input Modeling

Selecting Model without Data • If data is not available, some possible sources to obtain information about the process are: • Engineering data: often product or process has performance ratings provided by the manufacturer or company rules specify time or production standards. • Expert option: people who are experienced with the process or similar processes, often, they can provide optimistic, pessimistic and most-likely times, and they may know the variability as well. • Physical or conventional limitations: physical limits on performance, limits or bounds that narrow the range of the input process. • The nature of the process. • The uniform, triangular, and beta distributions are often used as input models. [ see LK00 for details] CPSC 531:Input Modeling

Models of Arrival Processes • Poisson Process • Non-stationary Poisson • Batch Arrivals CPSC 531:Input Modeling

Poisson Process • Definition: Let N(t) denote the number of arrivals that occur in time interval [0,t]. • The stochastic process {N(t), t>=0} is a Poisson process with mean rate l if: • N(0) = 0 • Arrivals occur one at a time • {N(t), t>=0} has stationary increments – number of arrivals in a given interval depends only on the length of the interval, not its location • {N(t), t>=0} has independent increments – number of arrivals in disjoint time intervals are independent. • And … CPSC 531:Input Modeling

Poisson Process: Interarrival Times • Consider the interarrival times of a Possion process (A1, A2, …), where Ai is the elapsed time between arrival i and arrival i+1 • The 1st arrival occurs after time t iff there are no arrivals in the interval [0,t], hence: P{A1 > t} = P{N(t) = 0} = e-lt P{A1 <= t} = 1 – e-lt [cdf of exp(l)] • Interarrival times, A1, A2, …, are exponentially distributed and independent with mean 1/l Arrival counts ~ Poisson(l) Interarrival time ~ Exp(1/l) Stationary & Independent Memoryless CPSC 531:Input Modeling

Poisson Process: Splitting and Pooling • Splitting: • Suppose each event of a Poisson process can be classified as Type I, with probability p andType II, with probability 1-p. • N(t) = N1(t) + N2(t), where N1(t) and N2(t) are both Poisson processes with rates lp and l(1-p) • Pooling: • Suppose two Poisson processes are pooled together • N1(t) + N2(t) = N(t), where N(t) is a Poisson processes with rates l1 + l2 N1(t) ~ Poisson[lp] lp l N(t) ~ Poisson(l) N2(t) ~ Poisson[l(1-p)] l(1-p) l1 N1(t) ~ Poisson[l1] l1 + l2 N(t) ~ Poisson(l1 + l2) N2(t) ~ Poisson[l2] l2 CPSC 531:Input Modeling

Non-stationary Poisson Process (NSPP) • Poisson Process without the stationary increments, characterized by l(t), the arrival rate at time t. • The expected number of arrivals by time t, L(t): • Relating stationary Poisson process n(t) with rate l=1 and NSPP N(t) with rate l(t): • Let arrival times of a stationary process with rate l = 1 be t1, t2, …, and arrival times of a NSPP with rate l(t) be T1, T2, …, we know: ti = L(Ti) Ti = L-1(ti) CPSC 531:Input Modeling

Non-stationary Poisson Process (NSPP) • Example: Suppose arrivals to a Post Office have rates 2 per minute from 8 am until 12 pm, and then 0.5 per minute until 4 pm. • Let t = 0 correspond to 8 am, NSPP N(t) has rate function: Expected number of arrivals by time t: • Hence, the probability distribution of the number of arrivals between 11 am and 2 pm. P[N(6) – N(3) = k] = P[N(L(6)) – N(L(3)) = k] = P[N(9) – N(6) = k] = e(9-6)(9-6)k/k! = e3(3)k/k! CPSC 531:Input Modeling

Batch Process • Let N(t) be the number of batches that have arrived by time t. • If interarrival times of batches are IID exponential random variables, {N(t), t≥0} can be modeled as a Poisson process. • Let X(t) = total number of individual customers to arrive by time t; let Bi = number of customers in the ith batch; then • If Bi’s are IID random variables independent of {N(t) t≥0}, and if {N(t), t≥0} is a Poisson process, then the stochastic process {X(t), t≥0} is a Compound Poisson process CPSC 531:Input Modeling

CPSC 531:Input Modeling