460 likes | 675 Vues
Chapter1. Looking at Data - Distributions. Introduction. Goal: Using Data to Gain Knowledge Terms/Definitions: Individiduals : Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units )
E N D
Chapter1 Looking at Data - Distributions
Introduction • Goal: Using Data to Gain Knowledge • Terms/Definitions: • Individiduals: Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units) • Variables: Characteristics corresponding to individuals that can take on different values among individuals • Categorical Variable: Levels correspond to one of several groups or categories • Quantitaive Variable: Take on numeric values such that arithmetic operations make sense
Introduction • Spreadsheets for Statistical Analyses • Rows: Represent Individuals • Columns: Represent Variables • SPSS, Minitab, EXCEL are examples • Measuring Variables • Instrument: Tool used to make quantitative measurement on subjects (e.g. psychological test or physical fitness measurement) • Independent and Dependent Variables • Independent Variable: Describes a group an individal comes from (categorical) or its level (quantitative) prior to observation • Dependent Variable: Random outcome of interest
Independent and Dependent Variables • Dependent variables are also called response variables • Independent Variables are also called explanatory variables • Marketing: Does amount of exposure effect attitudes? • I.V.:Exposure (in time or number), different subjects receive different levels • D.V.: Measurement of liking of a product or brand • Medicine: Does a new drug reduce heart disease? • I.V.: Treatment (Active Drug vs Placebo) • D.V.: Presence/Absence of heart disease in a time period • Psychology/Finance: Risk Perceptions • I.V.: Framing of Choice (Loss vs Gain) • D.V.: Choice Taken (Risky vs Certain)
Rates and Proportions • Categorical Variables: Typically we count the number with some characteristic in a group of individuals. • The actual count is not a useful summary. More useful summaries include: • Proportion: The number with the characteristic divided by the group size (will lie between 0 and 1) • Percent: # with characteristic per 100 individuals (proportion*100) • Rate per 100,000: proportion*100,000
Graphical Displays of Distributions • Graphs of Categorical Variables • Bar Graph: Horizontal axis defines the various categories, heights of bars represent numbers of individuals • Pie Chart: Breaks down a circle (pie) such that the size of the slices represent the numbers of individuals in the categories or percentage of individuals.
Graphical Displays of Distributions • Graphs of Numeric Variables • Stemplot: Crude, but quick method of displaying the entire set of data and observing shape of distribution • Stem: All but rightmost digit, Leaf: Rightmost Digit • Put stems in vertical column (small at top), draw vertical line • Put leaves in appropriate row in increasing order from stem • Histogram: Breaks data into equally spaced ranges on horizontal axis, heights of bars represent frequencies or percentages
Example: Time (Hours/Year) Lost to Traffic Stems: 10s of hours Leaves: Hours Step 1: Stems: 1 2 3 4 5 Step 2: Stems and Leaves 1 48 2 01244699 3 0112244457778 4 122222245566 5 0336 Step 3: . Source: Texas Transportation Institute (5/7/2001).
Example: Time (Hours/Year) Lost to TrafficEXCEL Output Note in histogram, the bins represent the number up to and including that number (e.g. T14, 14<T21, …, 42<T49, T>49)
Comparing 2 Groups - Back-to-back Stemplots • Places Stems in Middle, group 1 to left, group 2 to right • Example: Maze Learning: • Groups (I.V.): Adults vs Children • Measured Response (D.V.): Average number of Errors in series of Trials
Example - Maze Learning (Average Errors) Stems: Integer parts Leaves: Decimal Parts
Examinining Distributions • Overall Pattern and Deviations • Shape: symmetric, stretched to one direction, multiple humps • Center: Typical values • Spread: Wide or narrow • Outlier: Individual whose value is far from others (see bottom right corner of previous slide) • May be due to data entry error, instrument malfunction, or individual being unusual wrt others
Numeric Descriptions of Distributions • Measures of Central Tendency • Arithmetic Mean: Total equally divided among individual cases • Median: Midpoint of the distribution (M) • Measures of Spread (Dispersion) • Quartiles (first/third): Points that break out the smallest and largest 25% of distribution (Q1 , Q3) • 5 Number Summary: (Minimum,Q1,M,Q3,Maximum) • Interquartile Range: IQR = Q3-Q1 • Boxplot: Graphical summary of 5 Number Summary • Variance: “Average” squared deviation from mean (s2) • Standard Deviation: Square root of variance (s)
Measures of Central Tendency • Arithmetic Mean: Obtain the total by summing all values and divide by sample size (“equal allotment” among individuals) • Median: Midpoint of Distribution • Sort values from smallest to largest • If n odd, take the (n+1)/2 ordered value • If n even, take average of n/2 and (n/2)+1 ordered values
2005 Oscar Nominees (Best Picture) • Movie: Domestic Gross/Worldwide Gross • The Aviator: $103M / $214M • Finding Neverland: $52M / $116M • Million Dollar Baby: $100M / $216M • Ray: $75M / $97M • Sideways: $72M / $108M • Mean & Median Domestic Gross among nominees ($M):
Delta Flight Times - ATL/MCO Oct,2004 • N=372 Flights 10/1/2004-10/31/2004 • Total actual time: 30536 Minutes • Mean Time: 30536/372 = 82.1 Minutes • Median: 372/2=186, (372/2)+1=187 • 186th and 187th ordered times are 81 minutes: M=81
Measures of Spread • Quartiles: First (Q1aka Lower) and Third (Q3 aka Upper) • Q1 is the median of the values below the median position • Q3is the median of the values below the median position • Notes(See examples on next page): • If n is odd, median position is (n+1)/2, and finding quartiles does not include this value. • If n is even, median position is treated (most commonly) as (n+1)/2 and the two values (positions) used to compute median are used for quartiles.
Oscar Nominations: • # of Individuals: n=5 • Median Position: (5+1)/2=3 • Positions Below Median Position: 1-2 • Positions Above Median Position: 4-5 • Median of Lower Positions: 1.5 • Median of Lower Positions: 4.5 • ATL/MCO Flights: • # of Individuals: n=372 • Median Position: (372+1)/2=186.5 • Positions Below Median Position: 1-186 • Positions Above Median Position: 187-372 • Median of Lower Positions: 93.5 • Median of Upper Positions: 279.5
Outliers - 1.5xIQR Rule • Outlier: Value that falls a long way from other values in the distribution • 1.5xIQR Rule: An observation may be considered an outlier if it falls either 1.5 times the interquartile range above the third (upper) quartile or the same distance below the first (lower) quartile. • ATL/MCO Data: Q1=76 Q3=86 IQR=10 1.5xIQR=15 • “High” Outliers: Above 86+15=101 minutes • “Low” Outliers: Below 76-15=61 minutes • 12 Flights are at 102 minutes or more (Highest is 122). See (modified) boxplot below
Measures of Spread - Variance and S.D. • Deviation: Difference between an observed value and the overall mean (sign is important): • Variance: “Average” squared deviation (divides the sum of squared deviations by n-1 (as opposed to n) for reasons we see later: • Standard Deviation: Positive square root of s2
Example - 2005 Oscar Movie Revenues • Mean: x=80.4 • The Aviator: i=1 x1=103 Deviation: 103-80.4=22.6 • Finding Neverland: i=2 x2=52 Dev: 52-80.4= -28.4 • Million Dollar Baby: i=3 x3=100 Dev: 100-80.4=19.6 • Ray: i=4 x4=75 Dev: 75-80.4 = -5.4 • Sideways: i=5 x5=72 Dev: 72-80.4 = -8.4
Computer Output of Summary Measures and Boxplot (SPSS) - ATL/MCO Data
Linear Transformations • Often work with transformed data • Linear Transformation: xnew = a + bx for constants a and b (e.g. transforming from metric system to U.S., celsius to fahrenheit, etc) • Effects: • Multiplying by b causes both mean and standard deviation to be multiplied by b • Addition by a shifts mean and all percentiles by a but does not effect the standard deviation or spread • Note that for locations, multiplication of b precedes addition of a
Density Curves/Normal Distributions • Continuous (or practically continuous) variables that can lie along a continuous (practically) range of values • Obtain a histogram of data (will be irregular with rigid blocks as bars over ranges) • Density curves are smooth approximations (models) to the coarse histogram • Curve lies above the horizontal axis • Total area under curve is 1 • Area of curve over a range of values represents its probability • Normal Distributions - Family of density curves with very specific properties
Mean and Median of a Density Curve • Mean is the balance point of a distribution of measurements. If the height of the curve represented weight, its where the density curve would balance • Median is the point where half the area is below and half the area is above the point • Symmetric Densities: Mean = Median • Right Skew Densities: Mean > Median • Left Skew Densities: Mean < Median • We will mainly work with means. Notation:
Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean (m) and standard deviation (s). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: • Approximately half (50%) fall above (and below) mean • Approximately 68% fall within 1 standard deviation of mean • Approximately 95% fall within 2 standard deviations of mean • Virtually all fall within 3 standard deviations of mean • Notation when X is normally distributed with mean m and standard deviation s :
Example - Heights of U.S. Adults • Female and Male adult heights are well approximated by normal distributions: XF~N(63.7,2.5) XM~N(69.1,2.6) Source: Statistical Abstract of the U.S. (1992)
Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (- < m < , s > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution
Standard Normal (Z) Distribution Table Area 1-Table Area z
2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l
2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l
Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (XL , XU), where often the upper or lower bounds are or - • Step 3 - Transform XL and XU into Z-values: • Step 4 - Obtain P(ZL Z ZU) from Z-table
Example - Adult Female Heights • What is the probability a randomly selected female is 5’10” or taller (70 inches)? • Step 1 -X ~ N(63.7 , 2.5) • Step 2 -XL = 70.0 XU = • Step 3 - • Step 4 - P(X 70) = P(Z 2.52) = 1-P(Z2.52)=1-.9941=.0059 ( 1/170)
Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2- Determine the percentile of interest 100p% (e.g. the 90th percentile is the cut-off where only 90% of scores are below and 10% are above). • Step 3 - Find p in the body of the z-table and itscorresponding z-value (zp) on the outer edge: • If 100p< 50 then use left-hand page of table • If 100p50 then use right-hand page of table • Step 4 - Transform zp back to original units:
Example - Adult Male Heights • Above what height do the tallest 5% of males lie above? • Step 1 - X ~ N(69.1 , 2.6) • Step 2 - Want to determine 95th percentile (p = .95) • Step 3 - P(z1.645) = .95 • Step 4 - X.95 = 69.1 + (1.645)(2.6) = 73.4 (6’,1.4”)
Statistical Models • When making statistical inference it is useful to write random variables in terms of model parameters and random errors • Here m is a fixed constant and e is a random variable • In practice m will be unknown, and we will use sample data to estimate or make statements regarding its value