Re-cap • Last day, looked at a variety of plots • For categorical variables, most useful plots were bar charts and pie charts • Looked at time plots for quantitative variables • Key thing is to be able to quickly make a point using graphical techniques
Re-cap • Recall: • A distribution of a variable tells us what values it takes on and how often it takes these values.
Histograms • Similar to a bar chart, would like to display main features of an empirical distribution (or data set) • Histogram • Essentially a bar chart of values of data • Usually grouped to reduce “jitteriness” of picture • Groups are sometimes called “bins”
Histogram • Uses rectangles to show number (or percentage) of values in intervals • Y-axis usually displays counts or percentages • X-Axis usually shows intervals • Rectangles are all the same width
Example (discrete data) • In a study of productivity, a large number of authors were classified according to the number of articles they published during a particular period of time.
Example (continuous data) • Experiment was conducted to investigate the muzzle velocity of a anti-personnel weapon (King, 1992) • Sample of size 16 was taken and the muzzle velocity (MPH) recorded
Constructing a Histogram – continuous data • Find minimum and maximum values of the data • Divide range of data into non-overlapping intervals of equal length • Count number of observations in each interval • Height of rectangle is number (or percentage) of observations falling in the interval • How many categories?
Example • Experiment was conducted to investigate the muzzle velocity of a anti-personnel weapon (King, 1992) • Sample of size 16 was taken and the muzzle velocity (MPH) recorded
What are the minimum and maximum values? • How do we divide up the range of data? • What happens if have too many intervals? • Too Few intervals? • Suppose have intervals from 240-250 and 250-260. In which interval is the data point 250 included?
Interpreting histograms • Gives an idea of: • Location of centre of the distribution • How spread are the data • Shape of the distribution • Symmetric • Skewed left • Skewed right • Unimodal • Bimodal • Multimodal • Outliers • Striking deviations from the overall pattern
Example – mid-term 1 grades (2011) • Was out of 34 + a bonus question (n=344)
Example – mid-term 1 grades (2011) • Too many bins?
Example – mid-term 1 grades (2011) • Too few bins
Example – mid-term 1 grades (2011) • Potential outlier?
Numerical Summaries (Chapter 12) • Graphic procedures visually describe data • Numerical summaries can quickly capture main features
Measures of Center • Have sample of size n from some population, • An important feature of a sample is its central value • Most common measures of center - Mean & Median
Sample Mean • The sample mean is the average of a set of measurements • The sample mean:
Sample Median • Have a set of n measurements, • Median (M) is point in the data that divides the data in half • Viewed as the mid-point of the data • To compute the median: • For sample size “n”, compute position = (n+1)/2 • If position is a whole number, then M is the value at this position of the sorted data • If position falls between two numbers, then M is the value halfway between those two positions in the sorted data
Example • Finding the Median, M, when n is odd • Example: Data = 7, 19, 4, 23, 18
Muzzle Velocity Example • Data (n=16)
Muzzle Velocity Example • Mean:
Muzzle Velocity Example • Median:
Sample Mean vs. Sample Median • Sometimes sample median is better measure of center • Sample median less sensitive to unusually large or small values called • For symmetric distributions the relative location of the sample mean and median is • For skewed distributions the relative locations are
Other Measures of interest • Maximum • Minimum
Percentiles • A percentile of a distribution is a value that cuts off the stated part of the distribution at or below that value, with the rest at or above that value. • 5th percentile: 5% of distribution is at or below this value and 95% is at or above this value. • 25th percentile: 25% at or below, 75% at or above • 50th percentile: 50% at or below, 50% at or above • 75th percentile: 75% at or below, 25% at or above • 90th percentile: 90% at or below, 10% at or above • 99th percentile: ___% at or below, ___% at or above
Percentiles • Can be applied to a population or to a sample • Usually don’t know population • Use sample percentiles to estimate pop. percentiles • Standardized tests often measured in percentiles • Birth statistics often measured in percentiles • First daughter • 10th percentile weight • 25th percentile length • 95th percentile head circumference
Important Percentiles • First Quartile • Second Quartile • Third Quartile
Computing the quartiles • You know how to compute the median • Q1 = • Q3 =
Example • Finding the other quartiles • For Q1, find the median of all values belowM. • For Q3, find the median of all values aboveM. • Example: 4, 7, 18, 19, 23, M=18 • Q1: • Q3: • Example: 4, 7, 12,18, 19, 23, M=15 • Q1: • Q3:
5 number summary often reported: • Min, Q1, Q2 (Median), Q3, and Max • Summarizes both center and spread • What proportion of data lie between Q1and Q3?
Box-Plot • Displays 5-number summary graphically • Box drawn spanning quartiles • Line drawn in box for median • Lines extend from box to max. and min values. • Some programs draw whiskers only to 1.5*IQR above and below the quartiles
Can compare distributions using side-by-side box-plots • What can you see from the plot?
Example - Moisture Uptake • There is a need to understand degradation of 3013 containers during long term storage • Moisture uptake is considered a key factor in degradation due to corrosion • Calcination removes moisture • Calcination temperature requirements were written with very pure materials in mind, but the situation has evolved to include less pure materials, e.g. high in salts (Cl salts of particular concern) • Calcination temperature may need to be reduced to accommodate salts. • An experiment is to be conducted to see how the calcination temperature impacts the mean moisture uptake
Working Example - Moisture Uptake • Experiment Procedure: • Two calcination temperatures…wish to compare the mean uptake for each temperature • Have 10 measurements per temperature treatment • The temperature treatments are randomly assigned to canisters • Response: Rate of change in moisture uptake in a 48 hour period (maximum time to complete packaging)
Other Common Measure of Spread: Sample Variance • Sample variance of n observations: • Units are in squared units of data
Sample Standard Deviation • Sample standard deviation of n observations: • Has same units as data
Exercise • Compute the sample standard deviation and variance for the Muzzle Velocity Example
Comments • Variance and standard deviation are most useful when measure of center is • As observations become more spread out, s : increases or decreases? • Both measures sensitive to outliers • 5 number summary is better than the mean and standard deviation for describing (i) skewed distributions; (ii) distributions with outliers
Comments • Standard deviation is zero when • Measures spread relative to
More interpretation of s • Empirical Rule • If a distribution is bell-shaped and roughly symmetric, then • About 2/3 of data will lie within ±1s of • About 95% of data will lie within ±2s of • Usually all data will lie within ±3s of • So you can reconstruct a rough picture of the histogram from just two numbers
Example • Mid-term:
Example • Mid-term: • Empirical rule tells us