Statistics Workshop Multiple Regression Spring 2009 Bert Kritzer

Statistics Workshop Multiple RegressionSpring 2009Bert Kritzer

Regression • Simple regression vs. multiple regression • Linear vs. nonlinear relationships • Least squares estimation vs. maximum likelihood estimation • b’s vs. β’s • Standardized vs. unstandardized estimates • Regression models and causation

Bivariate Regression

Conditional distributions all have the same variance (“homoscedasticity”) Linearity: “Conditional Expectations” fall on a straight Line Statistical ModelBivariate Regression Y’s are statistically independent

“Normal” Regression Model

Tort Reform by Citizen Liberalism For every ten point increase in citizen liberalism, one less tort reform was adopted Y = 12.89 – 0.10X

Tort Reform by Citizen Liberalism SSDY SSDe se

Three Dimensions

1 predictor (bivariate): 2 predictors: 3 predictors: Regression Models

Statistical ModelMultiple Regression Random variables Y1, Y2,… Yn, are statistically independent with conditional mean: and Conditional variance =2 Therefore:

Matrix Presentation

Tort Reform by Citizen Liberalism & Elite Liberalism

Multiple Regression: Tort Reform by Citizen and Elite Liberalism

Correlation Between Predictors Elite Liberalism by Citizen Liberalism

Attitude toward the Police by Stops and Age

Attitude toward the Police by Stops and AgeUsing Excel

Multiple Regression Coefficients as Random Variables I ONE SAMPLE OF 100 Source | SS df MS Number of obs = 100 -------------+------------------------------ F( 2, 97) = 2.73 Model | 473.549401 2 236.7747 Prob > F = 0.0701 Residual | 8405.0406 97 86.6499031 R-squared = 0.0533 -------------+------------------------------ Adj R-squared = 0.0338 Total | 8878.59 99 89.6827273 Root MSE = 9.3086 ------------------------------------------------------------------------------ police | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- stops | .4482355 .9257543 0.48 0.629 -1.38913 2.285601 age | .131513 .056274 2.34 0.021 .0198247 .2432013 _cons | 33.54175 2.883879 11.63 0.000 27.81805 39.26545 ------------------------------------------------------------------------------

Multiple Regression Coefficients as Random Variables II 1,000 samples of 100 observations

InteprettingResults • R2 is proportion of variation explained (8.8%). • adjusted R2 • F provides a test of whether the results could occur by chance if allβ’s were 0. • Coefficient for stops means, if age is constant, one additional stop decreases support by about 2 points (1. 922). • Coefficient for age means, if number of stops is constant, support goes up about a tenth of a point (0.11) for each additional year; or, an increase of ten years in age, will increase support about one point. • t’s for each coefficient test whether that individual coefficient differs from 0 (both are statistically significant). • Each coefficient has both a point and an interval estimate

Model Specification • “Misspecification” • Impact of omitting variables • correlated or uncorrelated with other predictors • Impact of including irrelevant variables • Specifying the “form” of the relationship • Linear vs. nonlinear

Omission of Significant Variables • Bias (in the statistical sense) estimates of other variables • Can make other variables look more important • Can make other variables look less important or even not significant • Cope example (p. 93)

Testing Subsets of Variables vs.

Residuals in Multiple Regression • Normality: Small vs. large samples • Residual plots • Against predicted value and/or individual predictors • Heteroscedasticity • Nonlinearity • Outliers • Complex outliers and influential observations • “Regression diagnostics • Large samples • Time series data and “autocorrelation”

Prototypical Residual Plots

More Residual Plots

With Outliers

Without Outliers

Homoscedasticity and Heteroscedasticity

Robust Standard Errors • Heteroscedasticity • Systematic nonindependence • clusters • Alternative solutions • “Purging” heteroscedasticity by transforming the data • Modeling the heteroscedasticity with ML

ML for Heteroscedastic, Normal Regression

Standardized Regression • “beta” vs. “b” • Standardize all variables to have mean of 0 and standard deviation of 1 • Interpret results in terms of how many the dependent variable changes for a 1 standard deviation change in predictor • Danger of comparing across groups when means and standard deviations vary across groups

Practical Problems in Multiple Regression • Choosing from among a large number of predictors • Stepwise regression • Sample size constraints on the number of predictors • Intercorrelations among predictors and the idea of “holding constant” • Multicollinearity • Outliers • Errors in Variables • Nonlinearity

Errors in Variables • Errors in Y (dependent variable) • depress fit (R2) but does not affect coefficients • depressing fit can affect power of significance tests because standard errors are biased upward • Error in X’s (predictor variables) • depressing regression coefficients • depresses power of significance tests

Nonlinear Relationships

Transforming X

Transforming Y

Logarithms: Definition X is called the logarithm of N to the base bif bx = N, where N and b are both positive numbers and b ≠ 1 If and only if because because because

Standard LogarithmsBase 10 and Base e “Common” logarithm: Base 10: Log101000 = 3 “Natural” logarithm: Base e: Loge 1000 = 6.908 Ln 1000 = 6.908

Rules of Logarithms

Example

Log Likelihood for Binomial “is proportional to”

Causal Inference • Spurious relationships • Time ordering • Elimination of alternatives • Mutual causation • identification

Regression in Wage Discrimination CasesBazemore v. Friday, 478 U.S. 385 (1986)

University of Wisconsin1997 Gender Equity Pay Study

University of Wisconsin1997 Gender Equity Pay StudyCollege of Letters & Science Regression Model: MODEL1 Dependent Variable: LNSAL ln(Salary) Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 54 45.42831 0.84127 43.187 0.0001 Error 781 15.21362 0.01948 C Total 835 60.64193 Root MSE 0.13957 R-square 0.7491 Dep Mean 11.03245 Adj R-sq 0.7318 C.V. 1.26508 Parameter Estimates Parameter Standard T for H0: Variable Variable DF Estimate Error Parameter=0 Prob > |T| Label INTERCEP 1 11.163387 0.07575212 147.367 0.0001 Intercept GENDER 1 -0.021302 0.01263912 -1.685 0.0923 Male WHITE 1 -0.010214 0.01651535 -0.618 0.5364 White/Unknown PROF 1 0.175458 0.01853981 9.464 0.0001 Full Professor ASST 1 -0.193622 0.02286049 -8.470 0.0001 Assistant Prof ANYDOC 1 0.017376 0.03510405 0.495 0.6208 Any Terminal Degree COH2 1 -0.085045 0.02458236 -3.460 0.0006 Hired 1980-88 COH3 1 -0.153097 0.03408703 -4.491 0.0001 Hired 1989-93 COH4 1 -0.168758 0.04543305 -3.714 0.0002 Hired 1994-98 DIFYRS 1 0.003513 0.00156769 2.241 0.0253 YRS SINCE DEG BEFORE UW INASTYRS 1 -0.018596 0.00380222 -4.891 0.0001 YRS AS INSTR/ASST PROF ASSOYRS 1 -0.020570 0.00244673 -8.407 0.0001 YRS AS UW ASSOC FULLYRS 1 0.003528 0.00146692 2.405 0.0164 YRS AS UW FULL PROF LNRATIO 1 0.481871 0.21528902 2.238 0.0255 ln(mkt ratio) PLUS 41 DEPARTMENT “FIXED EFFECTS”

Equity Study: Fixed Effects DEPARTMENT FIXED EFFECTS Parameter Standard T for H0: Variable Variable DF Estimate Error Parameter=0 Prob > |T| Label AFRLANG 1 -0.037307 0.07287210 -0.512 0.6088 ANTHRO 1 -0.042490 0.05677832 -0.748 0.4545 AFRAMER 1 0.067777 0.06028682 1.124 0.2613 ARTHIST 1 -0.009346 0.06446204 -0.145 0.8848 ASTRON 1 0.025805 0.05767292 0.447 0.6547 BOTANY 1 -0.023055 0.06263077 -0.368 0.7129 COMMUN 1 -0.043242 0.06234593 -0.694 0.4882 CHEM 1 0.007705 0.04325153 0.178 0.8587 CLASSICS 1 -0.013697 0.07344295 -0.186 0.8521 COMMDIS 1 0.035164 0.05853836 0.601 0.5482 COMPLIT 1 -0.027078 0.07883924 -0.343 0.7313 COMPUT 1 0.198201 0.04934743 4.016 0.0001 EASIALG 1 -0.053194 0.06957342 -0.765 0.4448 ECON 1 0.169280 0.05319197 3.182 0.0015 ENGLISH 1 -0.053755 0.05584121 -0.963 0.3360 FRENITAL 1 -0.073378 0.05724591 -1.282 0.2003 GEOG 1 -0.014052 0.05781558 -0.243 0.8080 GEOLOGY 1 0.007804 0.05502894 0.142 0.8873 GERMAN 1 -0.079744 0.06744970 -1.182 0.2375 HEBREW 1 0.016752 0.09408135 0.178 0.8587 HISTORY 1 -0.031301 0.05059288 -0.619 0.5363 HISTSC 1 0.047905 0.07102221 0.675 0.5002 JOURNAL 1 -0.045840 0.05939580 -0.772 0.4405 LIBRYSC 1 -0.079658 0.06446705 -1.236 0.2170 LINGUIS 1 -0.105136 0.07404040 -1.420 0.1560 MATH 1 -0.034484 0.04433476 -0.778 0.4369 METEOR 1 -0.020649 0.05059822 -0.408 0.6833 MUSIC 1 -0.084759 0.06710503 -1.263 0.2069 PHILOS 1 -0.060066 0.05534808 -1.085 0.2782 PHYSICS 1 0.035945 0.04208888 0.854 0.3934 POLISC 1 0.001526 0.04407509 0.035 0.9724 PSYCH 1 0.043498 0.04718937 0.922 0.3569 SCAND 1 -0.068544 0.09877777 -0.694 0.4879 SLAVIC 1 0.081673 0.06944784 1.176 0.2399 SOCWORK 1 0.038894 0.05518913 0.705 0.4812 SOCIOL 1 0.034492 0.04455797 0.774 0.4391 SASIAN 1 -0.146444 0.07595848 -1.928 0.0542 SPANPORT 1 -0.102875 0.06176804 -1.666 0.0962 THEATRE 1 -0.076231 0.06933522 -1.099 0.2719 URBPLAN 1 -0.013524 0.05830072 -0.232 0.8166 ZOOL 1 -0.055001 0.05418789 -1.015 0.3104

Extending Regression • Qualitative predictors • Nonlinear relationships • Time series data • Panel models • “Limited” dependent variables • Nominal (including dichotomous) dependent variables • Count variables • “Selection” models • Tobit • Switching • Mutual causation models

Statistics Workshop Multiple Regression Spring 2009 Bert Kritzer