80 likes | 260 Vues
Stats. Section 2.7 Notes. Exploratory Data Analysis (EDA). Exploratory Data Analysis is the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate data sets in order to understand their important characteristics. Outlier.
E N D
Stats Section 2.7 Notes
Exploratory Data Analysis (EDA) • Exploratory Data Analysis is the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate data sets in order to understand their important characteristics.
Outlier • An outlieris a value that is located very far away from the vast majority of the other data values. This type of value is often referred to as an extreme value.
Notice that the last price seems to be far removed from the rest of the data values. With a little bit of work it can be found that the mean price is $6985.20 with a standard deviation of $3535.56. However, if the outlier is removed the mean becomes $6155.86 with a standard deviation of $1533.28! Needless to say, an outlier can have a dramatic impact on the mean and on the standard deviation. Consider the following data set representing the price of some chosen diamonds.
Boxplots • This type of graphical display is good for finding the center of the data, the spread of the data, the distribution of the data, and the presence of outliers. • For a set of data, the 5-number summary consists of the minimum value, the first quartile, the median, the third quartile, and the maximum value. • A boxplot (box-and-whisker diagram) uses all five of these characteristics of a given set of data.
Boxplot for Diamond Data Min = 3670 Q1 = 5170 Med = 6333 Q3 = 7176 Max = 18596
1.5 x IQR Rule for Outliers • Instead of “guessing” that a data value is an outlier, use the following rule: • Call an observation a suspected outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.
Going back to our data on diamonds: IQR = Q3 – Q1 = 7176 – 5170 = 2006 1.5 x 2006 = 3009 5170 – 3009 = 2161 7176 + 3009 = 10185 There are no data values that are below 2161, but 18596 is above 10185, so we can conclude that it is an outlier. We no longer suspect.