Probability Distributions

Probability Distributions

Histograms

Histograms…contd. • A histogram helps us study the shape of a frequency distribution • For example, we expect that length (or weight) of most organisms follows the familiar bell-shape • The bell-shape gives us biological information about the variable for the organism.

Histograms…contd. • However, a given histogram can only help us obtain information that is specific to the sample used. • If we can fit a curve to the histogram, we can then not only infer values for data-points within the range of the sample but also outside the range.

Intrapolation and Extrapolation

? 3.75 Intrapolation and Extrapolation If length (x) = 3.75 cm, what is the weight (y) expected to be? If length (x) = 3.75 cm, weight (y) = 0.0113(3.75)3.3409 = 0.9341 g

18 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 15 15 12 12 9 9 4 4 1 1 Fry length class (cm) Probability of inclusion in distribution What is the chance that a random fry you pick is in the length range 1.15-1.27? What is the chance that a random fry that you pick is in the class 1.6-1.7? = 18/100 = 0.18 = 18% What is the chance that a random fry that you pick is in the class 1.4-1.9? = (12+15+18+15+12)/100 =92/100 = 0.92 = 92% What is the chance that a random fry that you pick is in the class 1.1-1.2? =1/100 = 0.01 = 1%

Mathematical equations are fit to frequency distributions or histograms. Probability distribution They are called probability distributions

Note that all the relative frequencies in a frequency table must add up to 1 or 100%. • In other words, the total area under the probability distribution curve = 1.

Some common probability distributions • Continuous distributions: • The Normal distribution • The t distribution (Student’s t distribution) • The F distribution • The chi-square (2) distribution • The Gamma distribution

Some common probability distributions…contd. • Discrete distributions: • The Binomial distribution • The Hypergeometric distribution • The Negative Binomial distribution • The Poisson distribution

The Normal (Gaussian) Distribution • The distribution model that perhaps fits most commonly used frequency distributions of continuous variables. • Familiar symmetric, bell shape curve • What does this shape mean, biologically? • Many other distributions, including discrete distributions, approximate the normal distribution, under certain conditions

Probability Variable The Normal distributions The location and shape of the normal probability curve is completely defined by just two parameters and two constants: Parameters: : true mean of the variable, and : true standard deviation of the variable Constants: : = approx. 3.14159, and e: = approx. 2.71828

Let us fit a normal distribution • Data • Steps: • Construct a frequency table for the data and draw a histogram. • Obtain the f(Y) values for the mid points of each class based on the formula for the normal distribution. • Superimpose the f(Y) values for the midpoints on the histogram. Note: This is only an approximate fitting of the normal curve to the frequency table.

Fitting of the normal curve to the histogram

Relative Frequency (Probability) Variable Shapes of the normal distribution

The Normal distributions…contd. • There isn’t just one normal distribution – there is a normal distribution for each combination of values of  and . • The value of  decides the location of the distribution and the value of  decides the shape of the distribution.

Same  different  Same  different  Value of variable --- Location and Shape of the Normal distribution

34.13% 13.59% 2.14% Useful properties of the normal distribution    contains 68.27% of the items   2 contains 95.45% of the items   3 contains 99.73% of the items

50% 95% 99% Useful properties of the normal distribution…contd. 50% of the items fall between   0.674 95% of the items fall between  1.960 99% of the items fall between   2.576

These properties are useful only if • It is known that the variable follows the normal distribution – not usually a serious problem • The true mean and standard deviation are known for the population – unfortunately almost never true. • Even if they are known, we then need to obtain the cut-off values (as in previous slide) for each variable (because there is a different normal distribution for each variable). • Of what use are these properties then?

The Standard Normal Distribution • As mentioned before, there are infinite number of normal distributions based on the values of  and . • It would indeed be very tedious if the cut-off values have to be computed for each distribution. • Fortunately, there is another property of the normal distribution that allows us to standardize it.

The Standard Normal Distribution…contd. • If  and  are known then it is possible to compute the following: • This quantity, known as the standard normal deviate, gives the distance of an observation, Yi, from the mean, in terms of the standard deviation. • Thus, there is a change of units.

The Standard Normal Distribution…contd. • Z, known as the standard normal deviate, is also normally distributed, but with  = 0and  = 1 • This distribution is called the Standard Normal Distribution (SND). • The SND can be generated from any variable with any  and  as long as the variable is normally distributed.

+ -  = 0 Standard Normal Distribution

The Standard Normal Distribution…contd. • For example, let us assume that five different labs work on fruitflies in a university, and the fruitfiles in eqch collection are of different sizes. For one of the collections (ours), we know that the population mean wing length,  = 4.55 mmand the standard deviation,  = 0.39 mm. • Furthermore, we know that the variable follows the normal distribution. • Then an individual wing length of 4.1mm will be -1.1538 standard deviations from the mean

The Standard Normal Distribution…contd. • A typical question can be: • You have found a fly in the cafeteria and you are not sure if it belongs to our lab collection. • You know the values of the following parameters for the lab collection:  = 4.55 mmand  = 0.390 mm

The Standard Normal Distribution…contd. • You measure the wing length of the fly (let’s assume that wing length is an important distinguishing character) and find it to be 3.7 mm. • Let’s assume that the chances of a fly escaping the collection are NOT related to its wing-size. • In other words, every fly has an equal chance of escaping the collection.

The Standard Normal Distribution…contd. • If the fly had a wing length of 4.55 mm then logical conclusion = it likely is from collection because the wing-length is representative of the flies in the collection. • But it is 3.7 mm. • That is, it is rather smaller than the mean. • So we ask – how likely is a fly with wing length 3.7mm likely to belong to our collection?

The Standard Normal Distribution…contd. • Of course, the only information we may have is that the parameters,  = 4.55mmand  = 0.39mm, and that the variable follows a normal distribution. • With this information, we know that we can find the area under the normal curve for any two given cut-off points. • Recall the graphs (next slide)

34.13% 50% 13.59% 95% 2.14% 99%    contains 68.27% of the items 50% of the items fall between   0.674   2 contains 95.45% of the items 95% of the items fall between  1.960   3 contains 99.73% of the items 99% of the items fall between   2.576 The Standard Normal Distribution…contd.

The Standard Normal Distribution…contd. • We know that 3.7mm is less than the known parametric mean, 4.55mm. • But is it so low as to make it unlikely to belong to the collection? • So we ask – what proportion are as low as 3.7mm or lower?

We are interested in this area 4.55mm The Standard Normal Distribution…contd.

The Standard Normal Distribution…contd. • But how do we find that area if we do not have the original distribution? • Fortunately, we have all the information we need: • Variable follows normal distribution • Parametric mean = 4.55mm, and • Parametric standard deviation = 0.39mm

The Standard Normal Distribution…contd. With that information, we can obtain the Z value for our fly

The Standard Normal Distribution…contd. • Because the standard normal distribution is unique, the area between any two points on X-axis has already been computed • Available in any statistics textbook

Area given in table Area needed The Standard Normal Distribution…contd. Look up the standard normal tables with the value -2.18 We know that the area under the curve from 0 to -∞ is 0.5 The table gives us a value of 0.4854 as the area from 0 to -2.18 Therefore, the area from -2.18 to -∞ is given as 0.5 – 0.4854 = 0.0146 That is, approximately 1.5% of the flies from the collection are expected to have a wing length as small as 3.7 mm (z = -2.18) or lower. Z = -2.18 Or Wing-length = 3.7mm

The Standard Normal Distribution…contd. • The logic in making a decision about the question of whether the fly belongs to the lab collection is simple.

The Standard Normal Distribution…contd. • We don’t know which population the fly belongs to. • Even in our collection, flies with wing length ≤ 3.7mm is rare (~1.5%). • So is it reasonable to accept that a unknown fly of that wing length belongs to our collection?

The Standard Normal Distribution…contd. • General thumb rule: If the probability <0.05 (i.e., <5%) then you conclude that the chances that an observation belongs to that distribution are low. • So we conclude that the fly is unlikely to belong to our lab collection.

Some exercises • Show schematic diagrams of the area of interest, and the probabilities associated with the following • What if the wing-length of the fly was : 5.16mm? 5.26mm? 4.55mm? 3.63mm? 2.38mm? • What proportion of flies in the collection have wing-lengths between 3mm and 4mm? 5mm and 6mm 3mm and 7mm?

Probability Distributions