Descriptive Statistics-II

Descriptive Statistics-II Mahmoud Alhussami, MPH, DSc., PhD.

Shapes of Distribution • A third important property of data – after location and dispersion - is its shape • Distributions of quantitative variables can be described in terms of a number of features, many of which are related to the distributions’physical appearance or shape when presented graphically. • modality • Symmetry and skewness • Degree of skewness • Kurtosis

Modality • The modality of a distribution concerns how many peaks or high points there are. • A distribution with a single peak, one value a high frequency is a unimodal distribution.

A distribution with two or more peaks called multimodal distribution. Modality

Symmetry and Skewness • A distribution is symmetric if the distribution could be split down the middle to form two haves that are mirror images of one another. • In asymmetric distributions, the peaks are off center, with a bull of scores clustering at one end, and a tail trailing off at the other end. Such distributions are often describes as skewed. • When the longer tail trails off to the right this is a positively skewed distribution. E.g. annual income. • When the longer tail trails off to the left this is called negatively skewed distribution. E.g. age at death.

Symmetry and Skewness • Shape can be described by degree of asymmetry (i.e., skewness). • mean > median positive or right-skewness • mean = median symmetric or zero-skewness • mean < median negative or left-skewness • Positive skewness can arise when the mean is increased by some unusually high values. • Negative skewness can arise when the mean is decreased by some unusually low values.

Skewness Left skewed: Right skewed: Symmetric: 7

Shapes of the Distribution Three common shapes of frequency distributions: A B C Symmetrical and bell shaped Positively skewed or skewed to the right Negatively skewed or skewed to the left 8 3 November 2014

Shapes of the Distribution Three less common shapes of frequency distributions: A B C Bimodal ReverseJ-shaped Uniform 9 3 November 2014

Example: # hours to complete a task This guy took a VERY long time! 10

Degree of Skewness • A skewness index can readily be calculated most statistical computer program in conjunction with frequency distributions • The index has a value of 0 for perfectly symmetric distribution. • A positive value if there is a positive skew, and negative value if there is a negative skew. • A skewness index that is more than twice the value of its standard error can be interpreted as a departure from symmetry.

Measures of Skewness or Symmetry • Pearson’s skewness coefficient • It is nonalgebraic and easily calculated. Also it is useful for quick estimates of symmetry . • It is defined as: skewness = mean-median/SD • Fisher’s measure of skewness. • It is based on deviations from the mean to the third power.

Pearson’s skewness coefficient • For a perfectly symmetrical distribution, the mean will equal the median, and the skewness coefficient will be zero. If the distribution is positively skewed the mean will be more than the median and the coefficient will be the positive. If the coefficient is negative, the distribution is negatively skewed and the mean less than the median. • Skewness values will fall between -1 and +1 SD units. Values falling outside this range indicate a substantially skewed distribution. • Hildebrand (1986) states that skewness values above 0.2 or below -0.2 indicate severe skewness.

Fisher’s Measure of Skewness • The formula for Fisher’s skewness statistic is based on deviations from the mean to the third power. • The measure of skewness can be interpreted in terms of the normal curve • A symmetrical curve will result in a value of 0. • If the skewness value is positive, them the curve is skewed to the right, and vice versa for a distribution skewed to the left. • A z-score is calculated by dividing the measure of skewness by the standard error for skewness. Values above +1.96 or below -1.96 are significant at the 0.05 level because 95% of the scores in a normal deviation fall between +1.96 and -1.96 from the mean. • E.g. if Fisher’s skewness= 0.195 and st.err. =0.197 the z-score = 0.195/0.197 = 0.99

Kurtosis • The distribution’s kurtosis is concerns how pointed or flat its peak. • Two types: • Leptokurtic distribution (mean thin). • Platykurtic distribution (means flat).

Kurtosis • There is a statistical index of kurtosis that can be computed when computer programs are instructed to produce a frequency distribution • For kurtosis index, a value of zero indicates a shape that is neither flat nor pointed. • Positive values on the kurtosis statistics indicate greater peakedness, and negative values indicate greater flatness.

Fishers’ measure of Kurtosis • Fisher’s measure is based on deviation from the mean to the fourth power. • A z-score is calculated by dividing the measure of kurtosis by the standard error for kurtosis.

Normal Distribution • Also called belt shaped curve, normal curve, or Gaussian distribution. • A normal distribution is one that is unimodal, symmetric, and not too peaked or flat. • Given its name by the French mathematician Quetelet who, in the early 19th century noted that many human attributes, e.g. height, weight, intelligence appeared to be distributed normally.

Normal Distribution The normal curve is unimodal and symmetric about its mean (). In this distribution the mean, median and mode are all identical. The standard deviation () specifies the amount of dispersion around the mean. The two parameters  and  completely define a normal curve. Also called a Probability density function. The probability is interpreted as "area under the curve.“ The area under the whole curve = 1 19 3 November 2014

Sampling Distribution • A sample statistic is often unequal to the value of the corresponding population parameter because of sampling error. • Sampling error reflects the tendency for statistics to fluctuate from one sample to another. • The amount of sampling error is the difference between the obtained sample value and the population parameter. • Inferential statistics allow researchers to estimate how close to the population value the calculated statistics is likely to be. • The concept of sampling, which are actually probability distributions, is central to estimates of sampling error.

Characteristics of Sampling Distribution • Sampling error= sample mean-population mean. • Every sample size has a different sampling distribution of the mean. • Sampling distributions are theoretical, because in practice, no one draws an infinite number of samples from a population. • Their characteristics can be modeled mathematically and have determined by a formulation known as the central limit theorem. • This theorem stipulates that the mean of the sampling distribution is identical to the population mean. • A consequence of Central Limit Theorem is that if we average measurements of a particular quantity, the distribution of our average tends toward a normal one. • The average sampling error-the mean of the (mean-μ)sample would always equal zero.

Standard Error of the Mean • The standard deviation of a sampling distribution of the mean has a special name: the standard error of the mean (SEM). • The smaller the SEM, the more accurate are the sample means as estimates of the population value.

Central Limit Theorem • describes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population". • It predicts that regardless of the distribution of the parent population: • The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn. • The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N). • The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.

Standard Normal Variable It is customary to call a standard normal random variable Z. The outcomes of the random variable Z are denoted by z. The table in the coming slide give the area under the curve (probabilities) between the mean and z. The probabilities in the table refer to the likelihood that a randomly selected value Z is equal to or less than a given value of z and greater than 0 (the mean of the standard normal). 24 3 November 2014

Normal Distribution Source: Levine et al, Business Statistics, Pearson. 25

The 68-95-99.7 Rule for the Normal Distribution 68% of the observations fall within one standard deviation of the mean 95% of the observations fall within two standard deviations of the mean 99.7% of the observations fall within three standard deviations of the mean When applied to ‘real data’, these estimates are considered approximate! 26 3 November 2014

Normal Distribution Remember these probabilities (percentages): Practice: Find these values yourself using the Z table. 27 Two Sample Z Test

Standard Normal Curve 28 3 November 2014

Standard Normal Distribution 50% of probability in here –probability=0.5 50% of probability in here–probability=0.5 29 3 November 2014

Standard Normal Distribution 95% of probability in here 2.5% of probability in here 2.5% of probability in here Standard Normal Distribution with 95% area marked 30 3 November 2014

Calculating Probabilities Probability calculations are always concerned with finding the probability that the variable assumes any value in an interval between two specific points a and b. The probability that a continuous variable assumes the a value between a and b is the area under the graph of the density between a and b. 31 3 November 2014

Example: Weight If the weight of males is N.D. with μ=150 and σ=10, what is the probability that a randomly selected male will weigh between 140 lbs and 155 lbs? Normal Distribution 32

Example: Weight Solution: Z = (140 – 150)/ 10 = -1.00 s.d. from mean Area under the curve = .3413 (from Z table) Z = (155 – 150) / 10 =+.50 s.d. from mean Area under the curve = .1915 (from Z table) Answer: .3413 + .1915 = .5328 33

Example: IQ If IQ is ND with a mean of 100 and a S.D. of 10, what percentage of the population will have IQs ranging from 90 to 110? IQs ranging from 80 to 120? Solution: Z = (90 – 100)/10 = -1.00 Z = (110 -100)/ 10 = +1.00 Area between 0 and 1.00 in the Z-table is .3413; Area between 0 and -1.00 is also .3413 (Z-distribution is symmetric). Answer to part (a) is .3413 + .3413 = .6826. 34

Example: IQ (b) IQs ranging from 80 to 120? Solution: Z = (80 – 100)/10 = -2.00 Z = (120 -100)/ 10 = +2.00 Area between =0 and 2.00 in the Z-table is .4772; Area between 0 and -2.00 is also .4772 (Z-distribution is symmetric). Answer is .4772 + .4772 = .9544. 35

Example: Salary Suppose that the average salary of college graduates is N.D. with μ=$40,000 and σ=$10,000. • What proportion of college graduates will earn $24,800 or less? • What proportion of college graduates will earn $53,500 or more? • What proportion of college graduates will earn between $45,000 and $57,000? • Calculate the 80th percentile. • Calculate the 27th percentile. 36

Example: Salary (a) What proportion of college graduates will earn $24,800 or less? Solution: Convert the $24,800 to a Z-score: Z = ($24,800 - $40,000)/$10,000 = -1.52. Always DRAW a picture of the distribution to help you solve these problems. 37

Example: Salary $24,800 .4357 X $40,000 Z 0 -1.52 First Find the area between 0 and -1.52 in the Z-table. From the Z table, that area is .4357. Then, the area from -1.52 to - ∞ is .5000 - .4357 = .0643. Answer: 6.43% of college graduates will earn less than $24,800. 38

Example: Salary $40,000 .4115 .0885 $53,500 0 +1.35 Z (b) What proportion of college graduates will earn $53,500 or more? Solution: Convert the $53,500 to a Z-score. Z = ($53,500 - $40,000)/$10,000 = +1.35. Find the area between 0 and +1.35 in the Z-table: .4115 is the table value. When you DRAW A PICTURE (above) you see that you need the area in the tail: .5 - .4115 - .0885. Answer: .0885. Thus, 8.85% of college graduates will earn $53,500 or more. 39

Example: Salary $40k .4554 .1915 $45k $57k 0 1.7 .5 Z (c) What proportion of college graduates will earn between $45,000 and $57,000? Z = $45,000 – $40,000 / $10,000 = .50Z = $57,000 – $40,000 / $10,000 = 1.70 From the table, we can get the area under the curve between the mean (0) and .5; we can get the area between 0 and 1.7. From the picture we see that neither one is what we need. What do we do here? Subtract the small piece from the big piece to get exactly what we need. Answer: .4554 − .1915 = .2639 40

Z-scores and percentiles Parts (d) and (e) of this example ask you to compute percentiles. Every Z-score is associated with a percentile. A Z-score of 0 is the 50th percentile. This means that if you take any test that is normally distributed (e.g., the SAT exam), and your Z-score on the test is 0, this means you scored at the 50th percentile. In fact, your score is the mean, median, and mode. 41

Example: Salary $40,000 .3000 .5000 0 Z .84 ANSWER (d) Calculate the 80th percentile. Solution: First, what Z-score is associated with the 80th percentile? A Z-score of approximately +.84 will give you about .3000 of the area under the curve. Also, the area under the curve between -∞ and 0 is .5000. Therefore, a Z-score of +.84 is associated with the 80th percentile. Now to find the salary (X) at the 80th percentile: Just solve for X: +.84 = (X−$40,000)/$10,000 X = $40,000 + $8,400 = $48,400. 42

Example: Salary $40,000 .5000 .2300 .2700 0 Z -.61 ANSWER (e) Calculate the 27th percentile. Solution: First, what Z-score is associated with the 27th percentile? A Z-score of approximately -.61will give you about .2300 of the area under the curve, with .2700 in the tail. (The area under the curve between 0 and -.61 is .2291 which we are rounding to .2300). Also, the area under the curve between 0 and ∞ is .5000. Therefore, a Z-score of -.61 is associated with the 27th percentile. Now to find the salary (X) at the 27th percentile: Just solve for X: -0.61 =(X−$40,000)/$10,000 X = $40,000 - $6,100 = $33,900 43

Graphical Methods Frequency Distribution Histogram Frequency Polygon Cumulative Frequency Graph Pie Chart. 44 3 November 2014

Presenting Data • Table • Condenses data into a form that can make them easier to understand; • Shows many details in summary fashion; BUT • Since table shows only numbers, it may not be readily understood without comparing it to other values.

Principles of Table Construction • Don’t try to do too much in a table • Us white space effectively to make table layout pleasing to the eye. • Make sure tables & test refer to each other. • Use some aspect of the table to order & group rows & columns.

Principles of Table Construction • If appropriate, frame table with summary statistics in rows & columns to provide a standard of comparison. • Round numbers in table to one or two decimal places to make them easily understood. • When creating tables for publication in a manuscript, double-space them unless contraindicated by journal.

Frequency Distributions A useful way to present data when you have a large data set is the formation of a frequency table or frequency distribution. Frequency– the number of observations that fall within a certain range of the data. 48 3 November 2014

Frequency Table 49

Presenting Data Chart • Visual representation of a frequency distribution that helps to gain insight about what the data mean. • Built with lines, area & text: bar charts, pie chart

Descriptive Statistics-II

Descriptive Statistics-II

Presentation Transcript

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Statistics - Descriptive statistics

Descriptive Statistics II: Measures of Dispersion

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive statistics

Descriptive Statistics

Descriptive Statistics

Descriptive statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics

Descriptive Statistics