250 likes | 479 Vues
Displaying and Describing Categorical Data. Chapter 3. Contingency Table (a.k.a. Two-way Table). A table that shows the frequency distribution across two variables. The Super Bowl Indicator. Can the winner of the Super Bowl predict the stock market?
E N D
Contingency Table(a.k.a. Two-way Table) • A table that shows the frequency distribution across two variables
The Super Bowl Indicator • Can the winner of the Super Bowl predict the stock market? • If the winner of the Super Bowl is from the original National Football League, there will be a bull market (Dow Jones Index increases). • If the winner of the Super Bowl is from the original American Football League, there will be a bear market (Dow Jones Index decreases). • Right 80% of the time
Independent • Two variables are independent when the distribution of one variable is the “same” for all categories of the other variable • You must Think carefully about which variable you are treating as the Who and which you are treating as What
Independent? Who: Market What: Super Bowl Winner Who: Super Bowl Winner What: Market Type
Not Independent • Since the distribution of winning league for a bull market is different than the marginal distribution of winning league, winning league and market type are not independent. • There appears to be an association between the original league of the Super Bowl winner and market type • Since the distribution of market type when the winner of the Super Bowl is originally from the AFL is different than the marginal distribution of market type, market type and winning league are not independent. • There appears to be an association between market type and the original league of the Super Bowl winner
Not Independent • Since the distribution of winning league for a bull market is different than the marginal distribution of winning league, winning league and market type are not independent. • There appears to be an association between the original league of the Super Bowl winner and market type • Since the distribution of market type when the winner of the Super Bowl is originally from the AFL is different than the marginal distribution of market type, market type and winning league are not independent. • There appears to be an association between market type and the original league of the Super Bowl winner
Three Rules of Data Analysis • Make a picture • Make a picture • Make a picture
Barry Bonds’ HRs Who: MLB Seasons from 1986 to 2007 What: Barry Bonds’ HRs (HRs) When: From 1986 to 2007 Where: Cities with MLB teams Why: Mr. Gray likes baseball and needed an example How: Data was gathered from baseball-reference.com
Quantitative Data • A quantitative variable is a measured variable (with units) that answers questions about the quantity of what is being measured. (e.g. income ($), height (inches), weight (pounds)) • The data are values of a quantitative variable whose units are known Quantitative Data Condition
Histogram • When to use: Number of variables: 1 Data type: quantitative data Purpose: displaying data distribution Gaps in the graph are gaps in the data
What to look for When you describe a distribution alwaysdescribe the • Shape • Center • Spread
Shape • Does the histogram have a single, central peak or several separated peakss? • Is the histogram symmetric? • Do any unusual features appear?
1. Peaks • The peaks in a histogram are called modes. • Uniform -- no peaks • Unimodal– one peak • Bimodal – two peaks • Multimodal – three or more peaks
2. Symmetry Symmetric Skewed
3. Unusual • Outlier – an unusually small or large data value • Gap – space between data values
Center“One number to rule them all” • When the distribution is skewed or has outliers, use the median Median -- the middle number when the set is ordered • If there is an even number of data values, the median is the average of the two middle values Has the same units as the data! • When the distribution is unimodal and symmetric, use the mean
Quartiles • 50% of the data lies below the median, 50% of the data lies above the median • Quartile 1(Q1) – the number with 25% of the data below and 75% of the data above • “the median of the lower half of the data” • Quartile 3 (Q3) -- the number with 75% of the data below and 25% of the data above • “the median of the upper half of the “data”
Spread • When the distribution is skewed or has outliers, use the IQR Interquartile Range (IQR) • The difference between quartile 3 and quartile 1 • IQR = Q3 – Q1 Has the same units as the data! • When the distribution is unimodal and symmetric, use the standard deviation
Five Number Summary • Min • Q1 • Median • Q3 • Max • 5 • 25 • 34 • 45 • 73
In Context • IQR -- Barry Bonds hit between 25 and 45 HRs in 50% of MLB seasons from 1986 to 2007 • Median • Barry Bonds hit less than 34 home runs in 50% of MLB seasons from 1986 to 2007 • Barry Bonds hit more than 34 home runs in 50% of MLB seasons from 1986 to 2007