410 likes | 695 Vues
Measures of Central Tendency. {. Measures of Central Tendency. Some are more robust than others …. The value that best represents the mid-point of a set of values, but which may not actually be found in the set of values themselves. Major types are: Means : - Arithmetic
 
                
                E N D
{ Measures of Central Tendency Some are more robust than others… The value that best represents the mid-point of a set of values, but which may not actually be found in the set of values themselves. Major types are: Means: - Arithmetic - Weighted/Grouped - Geometric - Harmonic - Trimmed Median Mode
What Does Being Robust Mean? When a statistic is robust it means that deviations from the underlying assumptions of a data distribution do not affect the statistic’s ability to represent the data values that comprise a dataset’s distribution. WHAT DOES THAT MEAN? The Wisdom of the Crowd That a population sample’s value (its “sample statistics”) is a good representation of what’s happening in The Population. But to say this we need to have some assumptions about the sample’s view of The Population’s reality. These are called the underlying assumptions.
The Wisdom of the Crowd • Regression to the Mean Experiment There are 39 guesses. The mean of guesses is 5,933. Note that the running mean approximates the actual number closer and closer. Number
Differenceof Actual Number to Running Mean With all guesses included, the mean of the guesses approaches the actual number much faster. With the extreme value removed (58,000) the mean of the remaining guesses still approaches the actual number. As ‘n’ increases, the mean will approach the actual value more and more closely.
What Does Regression to the Mean, Mean? So regression to the mean (more or less) means that eventually sample values for a variable will get closer and closer to the mean of all values for the variable. Put another way, the further a given sample value is from the mean, the higher the probability that the next sample will be closer to the mean. It is what’s known as an artefact of the data – or more precisely of the sample. It occurs basically because of non-random sampling, and it can be a problem in statistical sampling as we will see later.
The Assumptions Underlying Why The Crowd’s View is a Good Approximate of The Population’s RealityAnd this is the longest title in the history of PowerPoint. WooHoo! These are the underlying assumptions for The Crowd: Data distributions are normally distributed (a.k.a. bell shaped). This means that: • They have no outliers • They have no gaps • They are not skewed (skewness) • They are not peaked (kurtosis) • They have no extreme values • They are not bi-modal (two peaks) • They are not poly-modal (many peaks) • Their measures of central tendency are equal (hmmm). This is called being “robust”
What Does Being Robust Mean? Deviant distributions (distributions that deviate from the prior assumptions) happen because of non-normal attributes such as: • extreme values • bi-modality or poly-modality • outliers • gaps • skewness • kurtosis • extreme differences between values These, thankfully for Statistics, are rare occurrences, but we must always check for them.
What Does Being Robust NotMean? Robustness = ability to withstand assumption violation Being robust does not mean being better or more accurate – quite the opposite. It means the statistical tool is better able to withstand assumption violation. Discrimination = ability to accurately represent a set of values Statistical tools with lowrobustness usually have higherability to discriminate Statistical tools with highrobustness usually have lowerability to discriminate
Parametric and Non-Parametric Tools The assumptions underlying statistical distributions are called parameters. Parametric statistical tools usually have a low ability to withstand assumption violation. Non-Parametric statistical tools usually have a high ability to withstand assumption violation. Therefore… Parametric statistical tools usually have low robustness but have a higher ability to discriminate Non-Parametric statistical tools usually have high robustness but have a lower ability to discriminate
Being Robust Middle value Extremevalue Mean is almost 3 times the median
Thus the median is the more ‘robust’ statistic because it is less effected by the extreme value– it represents the dataset more accurately. This is the median (5) …and it drags the mean out with it. This is the mean (14) This is an outlier – waaaaay out. These are our values and where they lie on the distribution.
Now the mean is the better statistic because it is more accurate since it uses arithmetic to represent the dataset. But the median is still more robust. This is the median and the mean. …so the mean moves back. Outlier is gone. These are our values and where they lie on the distribution.
Arithmetic Mean Arithmetic lives here. Returns the arithmetic centre of the data distribution, such that the sum of all data value differences from the mean equals zero. The arithmetic mean is the arithmetic middle point of a set of values. This means that the differences between any value x in a dataset and the mean of that dataset will sum to zero. THIS DOES NOT MEAN THAT THE ARITHMETIC MEAN IS THE MID POINT OF THE DATASET’S DISTRIBUTION BECAUSE THE ARITHMETIC MEAN IS STRONGLY INFLUENCED BY EXTREME VALUES.
Median Arithmetic does not live here. Returns the exact centre value of the dataset and hence the second quartile value. Half the values of the dataset will be above the median and half will be below. The median is the middle case of a set of cases (records or rows). THIS MEANS THAT THE MEDIAN IS THE EXACT CENTRAL POINT OF A DATASET’S DISTRIBUTION – 50% of values will be below the median and 50% will be above. THIS MEANS THAT THE MEDIAN IS NOT AFFECTED BY EXTREME VALUES.
Mode (or Modality) The “modality” of this sample would be Protestant because it is the most frequently occurring “value”. Returns the most frequently occurring data value in the dataset. Sometimes reported as a label, when the “value” is nominal level data – e.g. religious or political affiliation.
The Arithmetic Mean where: = the arithmetic mean ∑= the sum of all x’s x = a value in the dataset n = the number of values (or cases) in the dataset
SAMPLE AND POPULATION SYMBOLOGY • In formulas for a sample(such as the arithmetic mean), Latin letters are used, and a lower case ‘n’ used for the number of cases. • In formulas for a population(such as the arithmetic mean), Greek symbols and letters are used (here delta) and a upper case ‘N’ used for the number of cases. δ
THE ARITHMETIC MEAN - DIFFERENCES SUM TO ZERO e.g. 38.25 – 21 = 17.25 38.25 – 34 = 4.25 Remember this.
THE ARITHMETIC MEAN - EFFECT OF EXTREME VALUES Middle dataset is unbalanced. It has one extreme value that pulls the mean higher than all but that extreme value. Note that median and mode are not affected. An opposite extreme value balances the dataset again. When an extreme value is present the median should be used and not the arithmeticmean because the distribution will be skewed. BUT YOU ARE STILL LEFT WITH EXTREME VALUES. These will affect the deviation in the dataset (s and s2).
THE MEDIANThe median is the middle point of a set of cases. Since there are an even number of values you take the mean of the centre two values - $42,000 If there were an odd number of values, then the single middle value is the median. Half the values above… …and half below.
Mean and Median – Points to Remember • The Mean is the arithmetic middle point of a set of values. It is calculated arithmetically from all data values. It is not the exact mid point of a set of values because it is strongly influenced by extreme values. BECAUSE OF THIS THE ARITHMETIC MEAN IS NOT A ROBUST STATISTIC BUT IT IS A DISCRIMINATING ONE. • The Median is the middle point of a set of cases (records or rows). It is calculated by dividing the number of rows into two halves. Because it is the exact mid point of a set of cases it is not influenced by extreme values. BECAUSE OF THIS THE MEDIAN IS A ROBUST STATISTIC BUT IT IS NOT A DISCRIMINATING ONE.
The Mode The most frequently occurring data value (in this example $45,000 in each dataset)
Gaps and Outliers The following histogram has outliers—there are three cities in the leftmost bar. This creates a gapwhere there are effectively no values. Gap Outliers
Using Measures of Central Tendency Use the method that returns the most information about the centre of the dataset – usually the arithmetic mean. BUT… With highly skewed (such as income) or non-unimodal datasets the medianshould be used. Means and medians cannot be used with nominal level data – the modecan be used to describe the most frequently occurring label. USING THE MODE IS CALLED MODAL ANALYSIS.
Other means to an end… • Weighted mean: Useful when the ‘x’s have unequal weights as in grade calculations (e.g. tests worth 20% labs worth 30%, etc). • Grouped data mean: Useful when you only have data in categories, as with income classes – is a special case of the weighted mean. • Geometric mean: Useful when you have percentages, ratios, indexes or data covering several orders of magnitude. • Harmonic mean: Useful when you have rates as in calculating average speeds. • Trimmed mean: Useful for removing outliers.
Weighted Mean Where: = weighted mean xi= data value wi= weight of data value The weighted mean is used when data values have weighting schemes, as with the gradesin this course.
Weighted mean example #1 1 2 3 n=3 = 8270/100 = 82.7% 246/3=83% The arithmetic mean method The weighted mean method
Weighted mean example #2Changing the weights 1 2 3 n=3 246/3=83% = 8060/100 = 80.6% The arithmetic mean method The weighted mean method
Grouped data mean examplePopulation in Census Tract 12345.6 1 2 3 4 5 6 $180,000 = $6,410,000/249 = $25,742.97 $180,000/6 = $30,000 The weighted mean method The arithmetic mean method
Geometric mean GM GM The ∏ symbol is the upper case Greek letter pi and signifies the product of a set multiplications. Where: GM: geometric mean x: data values n√ : nth root of product of all x Used extensively in biology and finance
Geometric mean – use when your data: • Are percentages, ratios, indexes or growth rates; • Have an exponential distribution; • Have high value more than 3 times the low value; • Cover several orders of magnitude. • Geometric mean – do not use when your data: • Are already log scaled such as decibels or pH; • Have high value less than 3 times the low value;
Geometric Mean ExampleExample using bacteria counts (they typically vary widely) Basically the data are log transformed. Thus extreme values are tempered. GM = = 42.42
Harmonic mean HM Where: HM: harmonic mean 1/x: reciprocals of data values n : number of data values • Harmonic mean • Useful when you have rates per unit (such as distance per unit of time (speed) to average out.
Harmonic meanexample of transportation & speed What’s the average speed for a delivery truck given these data: Arithmetic mean speed = 60kph+80kph/2 = 70kph. But the time taken is 3hrs+2.25hrs=5.25hrs Therefore the actual average speed = 360km/5.25hrs=68.57kph HM This is what the harmonic mean does: HM = 2/((1/60kph)+(1/80kph)) = 68.57kph Difference small but over more segments it can be significant
Trimmed Mean This is easy: it is any mean where the outliers have been stripped or trimmed away. Thus you would sort your data and drop the top and bottom 10% of your values. This is called a 10% trimmed mean. You can drop whatever proportion of the dataset you wish - within reason.
Example of Trimmed Mean $286,000.00 $251,000.00 7 6
Moving average trend lines produce a smoother ‘actual value’ trend line that is based on consecutive recalculations of an arithmetic mean of set size. The weakness is that the line loses values, depending on what size you make the averaging group.