340 likes | 356 Vues
Learn how to compute mean, variance, and analyze histograms to understand data distribution. Gain insights on weighted mean and standard deviation.
E N D
Chapter 3 Data Characterization BUS304 – Data Characterization
Types of Data Measurements • Measurements of Center and Location • Measurements of Variation ?
Measurements for Population and Sample • In general, we use the same set of measurements for both population and sample • Population Parameters: numerical measurements for population. Usually represented using Greek letters or capitalized English letters. • “N” for pop. Size; “” for pop mean • Sample Statistic: numerical measurements for sample. Usually represented using small English letters. • “n” for sample size; for sample mean
Sample Mean: “sample average” Formula: Population Mean: “population average” Formula: Most commonly used -- Mean • Characterize the center of the data distribution • The most commonly used data measure • Ways to compute the mean: • Use calculator. • Use Excel. (function: average) BUS304 – Data Characterization
Compute the mean for the following 2 groups of data Household income in community a: (Unit =10000$) Household income in community b: (Unit =10000$) Sensitivity to outliers If the mayor decide to provide more public facilities to poor communities, and the decision is made based on whether the mean income in the community is below $50,000 per year. Does such a decision make sense? BUS304 – Data Characterization
Exercise: The manager of a small hotel in Foster City, CA, was asked by the corporate VP to analyze the Sunday night registration information for the past eight weeks. Data on three variables were collected: • x1 = total number of rooms rented • x2 = total dollar revenue from the room rentals • x3 = number of customer complaints that came from guests each Sunday • Tasks: • Create a histogram for the distribution of number of customer complaints every day • Calculate the average number of rooms rented, the average revenue, the average number of complaints per day. • Calculate the average number of complaints per room rented • Explain the difference between “the average compliant per day”and “the average complaint per room rented“ from a managerial perspective. 6
Below is a frequency table showing the number of days the teams finish their projects How many days on average does a team finish one project? Create a histogram using the data on the left, locate the mean on the graph. How to describe the shape of the histogram? What is the relationship between the mean and peak? Use relative frequency to find out the mean. Compute the mean from frequency table BUS304 – Data Characterization
Estimating the mean from Histogram Treat Histogram as a frequency table, use the mid-value to estimate each range. Mathematical Expression: if sample, if population BUS304 – Data Characterization
Weighted Mean • The mean assumes that each piece of information equally. • E.g. students’ GPA and score calculation. • Weights are subjective. • E.g. Different instructors assign different weights to homework and exams. • Frequency table can be considered as an example of weighted mean (higher weights when higher frequency) BUS304 – Data Characterization
Exercise: • Estimate the mean based on the following histogram • There are 30 full time faculty in CoBA. Their average age was 43 in 2007. In 2008, one new faculty with age 30 was hired and one faculty retired at 65. What is the new mean age for CoBA faculty? BUS304 – Data Characterization
Variance • A measure of data spread. • Also called “the average of squared deviations from the mean” The larger the variance, the fat the histogram -- sample variance -- population variance Note the difference! BUS304 – Data Characterization
Steps to compute the variance • Identify whether the data are of a population or sample (the formulae are different.) • Use the following table to compute the deviation: • Find out the mean: • Find out the distance (fill out the 2nd column) • Find out the squared distance (the 3rd column) • Add up the 3rd column • divided by • population size; or • sample size -1 =5-3.833=1.167 =(1.167)2=1.36 BUS304 – Data Characterization
Comparing variance vs. histogram Find the variance for the following groups of sample data: Compare the mean and variance. Create the histogram to compare the distribution. BUS304 – Data Characterization
What does variance mean? • Variance indicate variation: • The larger the variance, the more spread out the data. • Indicates unpredictability. • E.g. • Weather data: weather changes dramatically, hard to predict tomorrow’s temperature (If look at temperature data: which has larger variance, Chicago or San Diego?) • Stock: more risk on returns. • A person’s performance: consistency. emotional… • Other examples? BUS304 – Data Characterization
Use frequency table to compute the populationvariance: Compute the weighted average BUS304 – Data Characterization
Standard Deviation • Square root of variance. • An indicator of data deviation, can be directly compared to the mean. Exercise: compute the standard deviation from the histogram on slide no. 5 and locate it on the histogram. OR Sample variance Population variance Sample standard deviation Population standard deviation BUS304 – Data Characterization
68% 99.7% 95% Empirical Rule • If the data is bell shaped (most of the time), then • 68% of all data will fall in the range of • 95% of all data will fall in the range of • 99.7% of all data will fall in the range of BUS304 – Data Characterization
Other Numerical Measures Median Mode Range Percentiles Quartiles, Interquartile range BUS304 – Data Characterization 18
-- The value which divides the data in half, with equal sizes above and below Median • Steps: • Put your data in ordered array (sort) • If n (or N) is odd, the median is the middle number • (i.e. the th number) • If n (or N) is even, the median is the average of two middle numbers • (i.e. the average of the and the +1 th numbers) • The middle value BUS304 – Data Characterization 19
Sensitivity to outliers 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median does not affected by extreme values Median = 2.5 0 1 2 3 4 5 6 7 8 9 10 Median = 3 BUS304 – Data Characterization 20
Exercise BUS304 – Data Characterization 21
The value that occurs most often Mode 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 Boston Austin San Diego Los Angels 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Steps: • Put your data in ordered array (sort) • Find the data value(s) that repeats the most frequently Mode does not affected by extreme value either. No Mode! Mode=5 Mode=San Diego Mode=5 and 9 BUS304 – Data Characterization 22
Find Mode and Median from Frequency Table Below is a frequency table showing the number of days the teams finish their projects Find the mean, median and mode. Create a histogram, locate the mode, median and mode. Describe the shape of the histogram, and find the relationship between mean, median and mode. BUS304 – Data Characterization 23
Shape of a distribution Right-Skewed Symmetric Left-Skewed Mode<Median <Mean Mean < Median <Mode (Longer tail extends to right) (Longer tail extends to left) Mean = Median =Mode Note that Mean is affected by the extreme value the most. So mean is always leaning towards the tail compared to the other two measures. BUS304 – Data Characterization 24
Measures of center location Mean Median Mode • Mean is generally used, unless extreme values (outliers) exist; • the next common is median, since the median is not sensitive to extreme values; • mode is sometime used when there is a really large frequency. Think: Are house prices normally right-skewed or left-skewed? What measurement People normally use to measure the house market? BUS304 – Data Characterization 25
Range Simplest measure of variation Describe how wide the data spread Formula Range = Maximum Value – Minimum Value Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 BUS304 – Data Characterization 26
Disadvantage of Range Ignores the way in which data are distributed Sensitive to outliers 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 7 8 9 10 11 12 Range = 5 - 1 = 4 Range = 12 - 7 = 5 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119 7 8 9 10 11 12 Range = 12 - 7 = 5 Range is affected the most by outliers. Feb 8, 2006 BUS304 – Data Characterization 27
Other measures Percentiles: Measures the percentage of data below the value. e.g. if the 60th percentile is 1240 (SAT score), that means there are 60% students getting a score less than 1240. Correspondingly, there are 40% of students getting 1240 or higher. How to find percentile? The pth percentile in an ordered array of n values is the value in the ith position, where BUS304 – Data Characterization 28
Example Find the 80th percentile from the annual income data Step: Sort the data Find the location for the 80th percentile: Find the 80.8th person’s income Where is the 80.8th person? Combine the 80th and 81st numbers 80th 62245 81st 63485 80.8th 62245*20%+63485*80%=63237 1st 100th 80th 81st 80.8th should be in between, and closer to 81st. 80% because of the decimal is .8 BUS304 – Data Characterization 29
Exercise Find the 25th percentile Find the 50th percentile Find the 75th percentile Explain the meaning of 50th percentile? Have you learnt a similar measurement? How many people have income levels between the 25th and the 50th percentiles? How many people have income levels between 50th and the 75th percentile? BUS304 – Data Characterization 30
Quartiles The 25th, 50th, and 75th percentiles Called the first, second, and third quartiles, respectively. Written as Q1, Q2, Q3, respectively. The quartiles split the ranked data into 4 equal groups. 25% 25% 25% 25% Q1 Q2 Q3 BUS304 – Data Characterization 31
Example: Example:Find the first quartile in the data sample: 22 12 14 16 17 16 132018 BUS304 – Data Characterization 32
Interquartile Range Recall: Range? Disadvantage of range? Interquartile Range: Interquartile Range = Q3 – Q1 Example: 12 13 14 16 16 17 18 20 22 Q1=13.5 Q3=19 Interquartile range = Q3 – Q1 = 19 – 13.5 = 5.5 BUS304 – Data Characterization 33
Summary Understand and compute the following two sets of data measures: Measures of central tendency Mean, Median, and Mode Measures of variation Range, Variance, and Standard deviation Other ways to describe data: Percentiles, Quartiles, Interquartile range BUS304 – Data Characterization 34