Performance Engineering

Performance Engineering Looking at Random Data & A Simulation Example Prof. Jerry Breecher

Goals: Look at the nature of random data. What happens as random data is used in multiple operations? Look at how network arrivals really work – are arrivals random or do they follow some other pattern? Use our simulation techniques to study these patterns (so this is really an example of simulation usage). Determine the difference in behavior as a result of network arrival patterns.

Random Arrivals Random Data Suppose we have a random number generator. And suppose we run a program using that data multiple times. Do the results of those multiple program executions converge or diverge? There is no simple intuitive answer to this question, so let’s try it.

Random Data Let’s take a very simple piece of code: if ( random() >= 0.5 ) HeadsGreaterThanTails++; else HeadsGreaterThanTails--; When we run the program, we collect the value of the variable every 100 million iterations – and do it for a total of 1 billion iterations. Here’s a sample run. After 400 million iterations, there were 3192 more “heads” than “tails”.

Random Data Now lets do that same thing for 8 processes What do you think will happen to the numbers? Will some process always have more heads than tails? Will the difference between results for processes depend on how many iterations have been done? Here’s the result for 8 processes:

Random Data And here’s the graph for those 8 processes – note there’s been a constant amount added to each value to get all the outputs positive.

Random Data As you can see in the last graph, the statistics are terrible – it’s hard to determine the pattern for multiple runs. So the program was run 10,000 times. And the minimum and maximum count was taken at each time interval for those 10,000 runs.

Random Data But, what happens if the processes doing random events interact with each other? This is the case if the programs are all accessing the same disk – we randomly choose which block in a large file is being written to. But each process must compete for the file lock and for disk access. Here’s the behavior of 10 disk-writing processes for 10,000 seconds. The numbers represent disk writes for that process during the time interval.

Random Data The accesses are clearly very close to each other

Random Data Comparing the 10 processes. This is the spread (difference) of the maximum less the minimum accesses for the process.

Random Data Comparing the 10 processes. Here’s how their relative performance varies over time. Note that no one process is always the minimum or the maximum performer.

Another Numerical Example I have two virtual cats, who share a single can of food at each meal. My cats are very finicky and get angry if their portions are unequal. I am finicky too, and I don't like dirtying dishes when I divvy it up. To split the food, then, I upend the open can of food onto a flat plate, then carefully lift the can off, leaving a perfectly formed virtual cylinder of food. Then I use the vanishingly small circular edge of the can to carefully cut the food into two exactly equal portions, one of which is shaped like a crescent moon, the other a cat's eye, or mandorla.

Another Numerical Example X B B A A a a X

// ////////////////////////////////////////////////////////////////////// // We're trying to solve the following problem. // Given two circles, how close should the centers of the circles be such // that the area subtended by the arcs of the two circles is exactly one // half the total area of the circle. // // See example 2.3.8 in Leemis & Park. // We use the book's definition for Uniform - see 2.3.3 // Here's how this works. Try a number of different distances between // the two circle centers. Then for the ones that are most successful, // zoom in to do them in more detail. // ////////////////////////////////////////////////////////////////////// #include <math.h> #include <stdlib.h> #define PI 3.1415927 #define TRUE 1 #define FALSE 0 // Prototypes double GetRandomNumber( void ); void InitializeRandomNumber( ); double ModelTwoCircles( double, int ); double Uniform( double min, double max) { return( min + (max - min)*GetRandomNumber() ); } Another Numerical Example

int main( int argc, char *argv[] ) { double Distance, Result = 0; double FirstSample = 0.1, LastSample = 1.9; double Increment, NewFirstSample; double BestDistance; int NumberOfSamples = 5000; int AnswerIsFound = FALSE; InitializeRandomNumber(); while ( !AnswerIsFound ) { printf( "\nNext Iteration starts at %f\n", FirstSample ); Increment = (LastSample - FirstSample)/10; NumberOfSamples = 2 * NumberOfSamples; for ( Distance = FirstSample; Distance <= LastSample; Distance += Increment ){ Result = ModelTwoCircles( Distance, NumberOfSamples ); if ( Result - 0.5000 > 0 ) NewFirstSample = Distance; if ( (0.5 - Result) < 0.0001 && (Result - 0.5) < 0.0001 ) { AnswerIsFound = TRUE; BestDistance = Distance; } printf( "Distance = %8.6f, Fraction = %8.6f\n", Distance, Result ); } FirstSample = NewFirstSample - 2 * Increment; LastSample = FirstSample + 4 * Increment; } printf( "\nThe best Distance is at %f using %d samples\n", BestDistance, NumberOfSamples ); }

double ModelTwoCircles( double Distance, int NumberOfSamples ) { double HitsInOneCircle = 0, HitsInTwoCircles = 0; double x, y, SecondDistance; int Samples; for ( Samples = 0; Samples < NumberOfSamples; Samples++ ) { do { x = Uniform( -1, 1 ); y = Uniform( -1, 1 ); } while ( (x * x) + (y * y) >= 1 ); // Loop until value in circle HitsInOneCircle++; SecondDistance = sqrt( ( x - Distance ) * (x - Distance ) + (y * y) ); if ( SecondDistance < 1.0 ) { HitsInTwoCircles++; // printf( "Samples: Second Distance = %8.6f\n", SecondDistance ); } } // End of for return( HitsInTwoCircles / HitsInOneCircle ); }

Random Arrivals Network Arrivals In our queueing analysis, we’ve assumed random arrivals (Poisson distribution, with exponentially distributed inter-arrival times.) This leads to our analysis of M/M/1 queues with Utilization = Service Time/Arrival Time and with Queue Length = U / ( 1 – U ). We generated uniformly distributed random numbers and based on those were able to derive the exponential arrival times and Poisson distributions. But is this how networks behave?

Self-Similar Arrivals • On the Self-Similar Nature of Ethernet Traffic • Leland, Taqqu, Willinger, Wilson. IEEE/ACM ToN, Vol. 2, pp 1-15, 1994 • Establish self-similar nature of Ethernet traffic • Illustrate the differences between self-similar and standard models • Show serious implications of self-similar traffic for design, control and performance analysis of packet-based communication systems Network Arrivals This how networks really behave?

Millions of packets from many workstations, as recorded on Bellcore internal networks. What Did Leland et. al Measure?

What Did Leland et.al Measure? Significance of self-similarity Nature of traffic generated by individual Ethernet users. Aggregate traffic study provides insights into traffic generated by individual users. Nature of congestion produced by self-similar models differs drastically from that predicted by standard formal models. We will show this by the simulation we perform here. Why is Ethernet traffic self-similar? Plausible physical explanation of self similarity in Ethernet traffic. (People don’t generate traffic randomly. They come to work at the same time, get tired at the same time, etc.) Mathematical Result • Superposition of many ON/OFF sources whose ON-periods and OFF-periods have high variability or infinite variance produces aggregate network traffic that is self-similar or long range independent. (Infinite variance here means that there are some samples with a very long inter-arrival time (lunch hour is a very long time!)

So are these bursts “random”? Can you tell by looking at the data. The answer is the data is bunched together – it’s not spread uniformly – and to be self-similar, the “bunches” themselves form “super-bunches”. What Did Leland et.al Measure?

Where does “Self-Similar” Data Occur? It occurs throughout nature. Also called Pareto Distribution, Bradford, Zipf, and various other names. Distribution of books checked out of a library. Distribution of lengths of rivers in the world. It’s NOT the same as an exponential distribution! (But it can look fairly close.) Fractals are an example of self-similarity.

In these equations: a = 1 (exponent falls to 1/e when x = 1.) The mean of these values is 1. Turns out the variance is also 1. The exponent is special that way. X0 is = 2. Then b was adjusted so that it gave a mean of 1. Arrivals for both distributions therefore have the same mean value. Exponential and Self-Similar Data Exponential Cumulative Function F(x) = 1 – e(-ax) Exponential Probability Density Function (PDF) f(x) = a e(-ax) Pareto Cumulative Function F(x) = 1 – (X0 / (X0 + x) )b Pareto Probability Density Function (PDF) f(x) = b X0 b/ (X0+x) (b+1)

Exponential and Self-Similar Data Note that the Pareto data has a higher value at the limits – this is what leads to it being self-same and to the data having a large variance. Pareto PDF (Purple) Exp PDF (Black)

Simulation So I wrote a simulator. There are two parts I especially want to show you: • The “guts” of the simulator – how events are taken off a queue and are processed; that processing generates new events. • How data is generated – starting with a random number in the range 0  1, how do we get an exponential distribution. • Here’s the code I used for the simulation. It’s not beautiful, but the price is right. http://www.cs.wpi.edu/~jb/CS533/Lectures/ArrivalSimulation.c Simulation Example

Simulation Initialize Event Queue Determine Next Event SCHEMATIC OF EVENT DRIVEN SIMULATION OF A NETWORK Set current time to the time of this event. Is it arrival or completion? Packet approaches network Network Service Completed Put packet on network; if queue WAS empty, generate a completion event Take packet off queue; if queue still has a packet, then generate completion. Determine future timefor next packet arriving. Determine when next packet will finish. Generate event for “Packet arrives at Q" Generate event for “Service Completed" Update Statistics Simulation Example

The Guts of the Simulation while( Iterations < RequestedArrivals ) { RemoveEvent( &CurrentSimulationTime, &EventType ); if ( EventType == ARRIVAL ) { if ( ArrivalDiscipline == EXPONENTIAL ) NextEventTimeInterval = GetExponentialArrival( ExponentialArrivalValue ); if ( ArrivalDiscipline == PARETO ) NextEventTimeInterval = GetParetoArrival( ParetoArrivalValue ); StoreStats( NextEventTimeInterval ); AddEvent( CurrentSimulationTime + NextEventTimeInterval, ARRIVAL ); if ( QueueLength == 0 ) { // Schedule completion event for this request NextEventTimeInterval = GetExponentialArrival( ServiceRate ); AddEvent( CurrentSimulationTime + NextEventTimeInterval, COMPLETION ); } // Do counting of state for stats purposes QueueLength++; } // End of EventType == ARRIVAL if ( EventType == COMPLETION ) { QueueLength--; if ( QueueLength > 0 ) { // Something else needs service NextEventTimeInterval = GetExponentialArrival( ServiceRate ); AddEvent( CurrentSimulationTime + NextEventTimeInterval, COMPLETION ); } } // End of EventType == COMPLETION } // End of while iterations // Print out the statistics: PrintStats(); Simulation Example

Data Generation Here’s the question we want to answer – given a PDF, how do we find what value generates a particular value of that PDF. For instance, applying this question to the Exponential Probability Density Function (PDF) f(x) = a e(-ax) , or f(x) = e –x for a == 1. what value of x produces the resultant f(x)? We generate random numbers in the range of 0  1. These are the f(x). So what values of x will give us this range of f(x)? For x = 0, f(x) == 1; For x = infinity, f(x) = 0. This inverse mapping is most easily accomplished by taking the inverse function. x = -ln( f(x) )  x = -ln( rand() ) Here’s the essence of this code: double GetExponentialArrival( double Argument ) { return( -log( 1.0 - GetRandomNumber() )/ Argument ); } // End of GetExponentialArrival Simulation Example

Data Generation • So having an inverse function is very nice – it’s one reason that using exponential function is so handy, and so universal. But for the Pareto PDF • f(x) = b X0b / (X0+x)(b+1) • The inverse function is much more difficult to find in this case. I solved this by doing a search. The binary search algorithm goes like this: • Pick a random number in the range 0 1; R = random(); • Calculate an f(y), and f(z) such that one of these is larger than R and one is smaller than R. • Calculate f( (y + z )/2 ) – for a value half way between y and z. • Determine y and z such that f(y) and f(z) again straddle R. • Loop to Step 3 until the value of ( R – f(y) ) is arbitrarily small. • All this is messy and compute intensive – but that’s the way it is when there’s no inverse function. Simulation Example

Simulation Results Results look very similar to the analytical functions. Simulation Example

The Q lengths are larger for Pareto Data. Does this make sense? Simulation Results Simulation Example

The Utilization is larger for Pareto Data. Does this make sense? Graphs Simulation Example

Marriage & Divorce Simulation The goal of this exercise to show the simulation of a “society”. In the larger context, it’s an example of how students might perform a simulation. Given a body of data, how do we arrange that data in order to represent how the society is behaving. This is essentially a “model” using the data. There are three ways we go about putting numerical values on this model.: Given a series of equations, can we simply solve the equations? If the equations don’t have a closed form solution, can we solve them recursively. There are no statistics involved here, but all we do is solve each equation over and over again and hope that it converges. This method gives us no details about the population since we’re simply solving equations. We can try for a “real” simulation. In this case, we use the probabilities and a random generator to try to simulate good years and bad years. This allows us to answer much more complex situations. We could now track characteristics for each individual in our society. We could, possibly, see how long a person in our society stays married for instance. Simulation Example

Marriage & Divorce Simulation There’s lots of stuff on the web, confusing and maybe contradictory: All data is for the US. In 2007, there were 2,200,000 marriages. This represents a rate of 7.5 per 1000 total population. Note this is 2.2M / 296M = 7.5. (Total US population is higher but some states don’t report.) Another metric which may be saying the same thing is that there are 39.9 marriages per 1000 single women. We’re going to use the first number here. In 2007, there were 856,000 divorces. This is 3.6 per 1000 total population. Interesting numbers, but not used here: 41% of 1st marriages end in divorce. 60% of 2nd marriages end in divorce. 74% of 3rd marriages end in divorce. The average remarriage occurs 3.3 years after a divorce. In 2007 there were 2.400,000 deaths representing a rate of 8.2 per 1000. Details of this on next page. 60% of all marriages last until 1 partner dies Birth rate is 13.8 per 1,000 population Recent statistics say that 51% of the adult population is married. This is important because we don’t use it directly as one of our equations – we use it to test if our model gives approximately this answer. Simulation Example

Marriage & Divorce Simulation In 2007 there were 2.400,000 deaths representing a rate of 8.2 per thousand. Details on this mortality data are for men and women 65+ : Death rate for married man is defined as 1.00 Death rate for a widowed man is 1.06 times that of a married man. Death rate for a divorced or separated man is 1.14 times that of a married man. Death rate for a never-married man is 1.05 times that of a married man. Death rate for married woman is defined as 1.00 Death rate for widowed woman is defined as 1.15 Death rate for divorced or separated woman is defined as 1.26 Death rate for a never-married woman is 1.18 times that of a married woman. This information is from “US Mortality by Economic, Demographic, and Social Characteristics: The National Longitudinal Mortality Study”, Sorlie, Backlund, and Keller, 1995 We use a rate that’s above and below the 8.2 per 1000 for the national average to take into account single and married rates. DeathMarriedRate = 7.6 per 1000 DeathSingleRate = 8.7 per 1000 Simulation Example

Marriage & Divorce Simulation Zombie Reincarnation = 100% Birth Rate Death while Single Single Divorce Rate Widowed Marriage Rate Married Death while Married Simulation Example

Leaving Zombie: DZ = - Rbirth * ( S + M ) Entering Zombie: DZ = + Rdeath-single * S + Rdeath-married * M Leaving Single: DS = -2 * Rmarriage * ( S + M ) - Rdeath-single * S Entering Single: DS = + Rbirth * ( S + M ) + 2 * Rdivorce * ( S + M ) + Rdeath-married * M Leaving Married: DM= -2 * Rdivorce * ( S + M ) - Rdeath-married * M Entering Married: DM= + 2 * Rmarriage * ( S + M ) In Steady State – leaving equals entering + Rdeath-single * S + Rdeath-married * M - Rbirth * ( S + M ) = 0 + Rbirth * ( S + M ) + 2 * Rdivorce * ( S + M ) + Rdeath-married * M -2 * Rmarriage * ( S + M ) - Rdeath-single * S = 0 + 2 * Rmarriage * ( S + M ) - 2 * Rdivorce * ( S + M ) - Rdeath-married * M = 0 Marriage & Divorce Simulation Simulation Example

In Steady State – leaving equals entering + Rdeath-single * S + Rdeath-married * M - Rbirth * ( S + M ) = 0 + Rbirth * ( S + M ) + 2 * Rdivorce * ( S + M ) + Rdeath-married * M -2 * Rmarriage * ( S + M ) - Rdeath-single * S = 0 + 2 * Rmarriage * ( S + M ) - 2 * Rdivorce * ( S + M ) - Rdeath-married * M = 0 Rearranging these equations gives: - Rbirth * ( S + M ) + Rdeath-single * S + Rdeath-married * M = 0 + Rbirth * ( S + M ) - 2 * Rmarriage * ( S + M ) + 2 * Rdivorce * ( S + M ) - Rdeath-single * S + Rdeath-married * M = 0 + 2 * Rmarriage * ( S + M ) - 2 * Rdivorce * ( S + M ) - Rdeath-married * M = 0 Maybe there’s a solution, but they seem redundant to me. Marriage & Divorce Simulation Here are links to the code and executables for this simulation: MarriageAndDivorceSimulation1.c // Recursively solves the equations MarriageAndDivorceSimulation1.exe MarriageAndDivorceSimulation2.c // Does a statistical simulation MarriageAndDivorceSimulation2.exe Simulation Example

WRAPUP This section has shown the result of a simulation. It’s gone through the coding, the data generation, and the interpretation of results. If network arrivals are Self-Similar, what about all kinds of other data generated by computers? What about requests arriving at a disk? What about processes arriving at a ready queue? Is there any computer data that REALLY is random, or is it all self-similar? Simulation Example

Performance Engineering