200 likes | 213 Vues
Explore how to conduct hypothesis tests on slopes in statistics, determine population generalizability, and assess slope probabilities. Understand assumptions, normality, variances, and regression hypothesis tests in bivariate analysis.
 
                
                E N D
AP Statistics Inference – Chapter 14
Hypothesis Tests: Slopes • Given: Observed slope relating Education to Job Prestige = 2.47 • Question: Can we generalize this to the population of all Americans? • How likely is it that this observed slope was actually drawn from a population with slope = 0? • Solution: Conduct a hypothesis test • Notation: slope = b, population slope = b • H0: Population slope b = 0 • H1: Population slope b 0 (two-tailed test)
Review: Slope Hypothesis Tests • What information lets us to do a hypothesis test? • Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic • It is the distribution of every value of the slope, based on all possible samples (of size N) • If certain assumptions are met, the sampling distribution approximates the t-distribution • Thus, we can assess the probability that a given value of b would be observed, if b = 0 • If probability is low – below alpha – we reject H0
Review: Slope Hypothesis Tests If b=0, observed slopes should commonly fall near zero, too Sampling distribution of the slope b If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0. 0 • Visually: If the population slope (b) is zero, then the sampling distribution would center at zero • Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero
Bivariate Regression Assumptions • Assumptions for bivariate regression hypothesis tests: • 1. Random sample • Ideally N > 20 • But different rules of thumb exist. (10, 30, etc.) • 2. Variables are linearly related • i.e., the mean of Y increases linearly with X • Check scatter plot for general linear trend • Watch out for non-linear relationships (e.g., U-shaped)
Bivariate Regression Assumptions • 3. Y is normally distributed for every outcome of X in the population • “Conditional normality” • Ex: Years of Education = X, Job Prestige (Y) • Suppose we look only at a sub-sample: X = 12 years of education • Is a histogram of Job Prestige approximately normal? • What about for people with X = 4? X = 16 • If all are roughly normal, the assumption is met
Bivariate Regression Assumptions Examine sub-samples at different values of X. Make histograms and check for normality. Good Not very good • Normality:
Bivariate Regression Assumptions • 4. The variances of prediction errors are identical at different values of X • Recall: Error is the deviation from the regression line • Is dispersion of error consistent across values of X? • Definition: “homoskedasticity” = error dispersion is consistent across values of X • Opposite: “heteroskedasticity”, errors vary with X • Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc. • Are the errors around line similar? Or different?
Bivariate Regression Assumptions Examine error at different values of X. Is it roughly equal? • Homoskedasticity: Equal Error Variance Here, things look pretty good.
Bivariate Regression Assumptions At higher values of X, error variance increases a lot. • Heteroskedasticity: Unequal Error Variance This looks pretty bad.
Bivariate Regression Assumptions • Notes/Comments: • 1. Overall, regression is robust to violations of assumptions • It often gives fairly reasonable results, even when assumptions aren’t perfectly met • 2. Variations of regression can handle situations where assumptions aren’t met • 3. But, there are also further diagnostics to help ensure that results are meaningful…
Regression Hypothesis Tests • If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution • Standard deviation of the sampling distribution is called the standard error of the slope (sb) • Population formula of standard error: • Where se2 is the variance of the regression error
Regression Hypothesis Tests • Estimating se2 lets us estimate the standard error: • Now we can estimate the S.E. of the slope:
Regression Hypothesis Tests • Finally: A t-value can be calculated: • It is the slope divided by the standard error • Where sb is the sample point estimate of the standard error • The t-value is based on N-2 degrees of freedom
Regression Confidence Intervals • You can also use the standard error of the slope to estimate confidence intervals: • Where tN-2 is the t-value for a two-tailed test given a desired a-level • Example: Observed slope = 2.5, S.E. = .10 • 95% t-value for 102 d.f. is approximately 2 • 95% C.I. = 2.5 +/- 2(.10) • Confidence Interval: 2.3 to 2.7
Regression Hypothesis Tests • You can also use a T-test to determine if the constant (a) is significantly different from zero • But, this is typically less useful to do • Hypotheses (a = population parameter of a): • H0: a = 0, H1: a 0 • But, most research focuses on slopes
Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • Outliers can result from: • Errors in coding or data entry • Highly unusual cases • Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope (b)
Regression: Outliers Extreme case that pulls regression line up 4 2 -2 -4 -4 -2 0 2 4 Regression line with extreme case removed from sample • Outlier Example:
Regression: Outliers • Strategy for dealing with outliers: • 1. Identify them • Look at scatterplots for extreme values • Or, have computer software compute outlier diagnostic statistics • There are several statistics to identify cases that are affecting the regression slope a lot • Examples: “Leverage”, Cook’s D, DFBETA • Computer software can even identify “problematic” cases for you… but it is preferable to do it yourself.
Regression: Outliers • 2. Depending on the circumstances, either: • A) Drop cases from sample and re-do regression • Especially for coding errors, very extreme outliers • Or if there is a theoretical reason to drop cases • Example: In analysis of economic activity, communist countries differ a lot… • B) Or, sometimes it is reasonable to leave outliers in the analysis • e.g., if there are several that represent an important minority group in your data • When writing papers, identify if outliers were excluded (and the effect that had on the analysis).