Lecture 18: Advanced model building

Lecture 18:Advanced model building March 24, 2014

Question So far the pace of the class has been • Substantially slower than I expected • Somewhat slower than I expected • About what I expected • Somewhat faster than I expected • Substantially faster than I expected

Administrative • Problem set 7 due Wednesday • Exam 2 in one week • over everything through chp 25 + online readings • Visiting Alumna – regression/data analysis in the workplace. • Wednesday’s class – partial review • Send questions by Tuesday (tomorrow) 12pm noon. • My review sessions are NOT me reiterating what you need to know for the exam.

Last time Partial, or incremental, F statistic and p-values

Partial F-test • Two equivalent formulas to calculate the partial F statistic: • k = # of variables in the complete model • j = # of variables in the reduced model • To get the p-value in Excel: =FDIST(F, k-j, n-k-1) • This is different than the F.DIST function!!! • FDIST is the “old” excel function. F.DIST.RT() is the new excel version. FDIST = 1 – F.DIST. Stupid. Stupid. Stupid Change from MS

Diagnostics • What did we do about outliers in simple regression? • Did another analysis without the observation and see if there are any differences with the various estimates. • The same can be true for multiple regression • sometimes much harder to identify which observation might be problematic because we’re dealing with a multi-dimensional space. • Leverage: a statistic we can calculate for each observation • A measure of influence of the observation on the model. • Ranges from 0 to 1 (low to high influence). • Observations with leverage values larger than3k/ npotentially problematic • k = number of parameters (# explanatory variables +1) of the model • nis sample size • Why is it a function of k and n?

Leverage • Unfortunately calculating the leverage of each observation is problematic in Excel. You can do it but it’s a royal pain • Doesn’t mean that you don’t need to know them, but I would give you the calculated leverage in an exam, etc. • Leverage is just the potential for being problematic. Once you’ve identified potentially troubling observations, try re-estimating the model without that data point. • Even if your results change, it doesn’t mean you should drop the data point(s). It might be completely legitimate. But it helps you understand your data. • A data point with high leverage isn’t necessarily an outlier. • Outliers have “unusual” values (relative to the rest of the dist).

Question • In the cafe2.csv dataset (47 obs) there are leverage values provided for the following regression model predicting Sales by using dummy variables for weekday, using Friday as the excluded group, Muffins.Sold, Cookies.Sold, Fruit.Cup.Sold, Chips, and Total.Soda.and.Coffee as predictors. • Above what leverage value would an observation be considered to have high leverage? • 0.638 • 0.447 • 0.702 • 0.347

Cook’s Distance • To look at potential outliers with high leverage, we can calculate the Cook’s Distance for an observation i • Cook’s Di < 0.5, usually fine. • 0.5 < Di < 1, might be problematic; • Di > 1, probably problematic. where: ei = residual for observation i k = number of parameters (# explanatory variables +1) of the model MSE = Mean Square Error = SSE / (n-k-1) = se2 hi = leverage of observation i • So I could provide the leverage values for each obs and have you calculate D, or identify leveraging observations, etc.

Model Validation • Assuming the data is OK, how do we tell if the model is good? • Looked at R2 and se, as we should. • It’s always a good idea to minimize se (if we’re interested in forecasting or predicting). Or maximize R2 if understand changes in Y. • But both are functions of the data • With different data, they might change. • And the data we have is just a sample. What makes this sample so much better than another one? Nothing… • Another common, and very good, approach is to perform some subsampling and model validation • Split the data: • Group 1 (the bigger group): use the data to fit a model • Group 2: use the regression from Group 1 to predict “out of sample” observations from Group 2. How well does it do? It’ll be almost surely be worse (why?) but how much worse? • Calculate Residuals: R2 (correlation^2 between fitted and observed), and se (stdev of residuals). • How do you choose the groups?

Model Validation • Often called “Cross-validation” or holdout analysis • If you play with the data too much while forming hypotheses, you run the risk of “finding” things that aren’t really there – remember that they data you observe is probably a sample. • Types of cross-validation • 2-fold: split the data into 2 sets • Set 1: “training” • Set 2: “testing” • K-fold: split the data into k different subsets • Repeated random subsampling (my preferred method) • Take a random subset of the data and withhold it: “testing data” • Use the remaining data to estimate a model and then see how well it predicts the “testing” data.

Model Validation Example • TeachingRatings.xls • Take a random subsample of 20% of the observations as your test data set • Use the remaining 80% observations to estimate course evals by nnenglish, beauty, and age. • Use estimated model to predict for the withheld testing data • Calculate overall fit statistics (RMSE, etc) On your own, try again except do a 50/50 split of the data (230 obs per set of data).

Causal Inference Correlation ≠ Causation • But sometimes it flirts and winks at it suggestively. • The real question is when can we use regression (i.e., a correlation analysis) to understand a causal effect? • The Economist: “That’s our job.” • The Statistician: “Almost Never.” • (both are overstatements but in the right direction)

Causal Inference • To understand when we can and can not use regression to infer something about the causal relationship we need to understand the gold standard and why/when it might fail. • Ideal: An experiment, or randomized controlled trial (RCT) • Think about medical trials. How would you know if a drug worked? (experimental design is harder than this)

Causal Inference • In the experiment setting think about what the regression model would look like: • What about the other covariates? • When there is no moderating influence, they’re in the error term and aren’t needed explicitly because people are randomized to Treatment or Control. • There are more complicated situations where you’d want to include other covariates (for instance interactions) in the model. • Both of these points assume that people comply and “take” the treatment.

Causal Inference • If we can approximate random assignment and believe don’t self-select out (two heroic assumptions), then we’re back to influence diagrams:

Causal Inference Relationships between variables: Type E: Affects the dependent variable directly but are influenced by the treatment variable. • Problematic variable!! • Don’t include in model • Known as: ‘controlling for a • post-treatment variable.’

Causal Inference • Note: • Including the post-treatment variable in the regression model will increase adjusted R2 (usually substantially) • Including the post-treatment variable might reduce the RMSE. • If you’re interested in the effect, i.e., getting the “right” or best estimate of the partial slope do not include the post-treatment variable. Forget about 1 and 2. • The recommendation for forecasting will be different.

Causal Inference All of the above said, there are still (several) fundamental problems: we’ll talk about 3 • Omitted Variable Bias • Talked about this before, but the remaining problematic cases we’ll talk about are variants on OVB. • You’ll have OVB if the omitted variable is correlated with an Independent Var and is a determinant of the Dependent Var. • Omitting the types B, C, or D from the influence diagram is the classic OVB. • Independent of adjusted R2: you can have OVB with a high (low) R2. • How do you deal with OVB if you don’t have data on the omitted variable? • Hard… but there are some things you can do (we won’t in the course)

Causal Inference Problem 2: Selection Bias • When you don’t have random assignment. In particular when you have one “type” or group that is disproportionately in the sample. • In my mind the key to causal inference is the counterfactual: what would the person have done had they not received the treatment? • Examples: • Positive effect of education on earnings? • Maybe, but the highest ability individuals may get more education, but would have had higher earnings regardless, hence the effect of education is overstated by a comparison of mean income conditional on education. Omitted variable bias. • CMUQ outreach activities? Or even effect of CMU?

Causal Inference Problem 3: Endogeneity • When X causes Y and Y causes X. • Unfortunately very common in economic settings. • Remember when we estimated the price elasticity early in the course? I.e., we had quantity and prices and did a log-log simple regression. • Well…that took strong assumptions. Price and quantity are a function of supply and demand. So where did the changes in price come from? Change in demand or change in supply? • Common econometrics technique: • Instrumental Variables (IV) and Two Stage Least Squares (TSLS). • How it works: find something, “an instrument,” that is correlated with the X but not with the error term (the omitted factors determining Y) • I’m happy to talk about these if you want to. It’ll take at least 1 full lecture (it’s easy to get lost).

Lecture 18: Advanced model building

Lecture 18: Advanced model building

Presentation Transcript

Lecture 18

Lecture 18

Lecture 17: Advanced model building

Lecture 18

LIS6 18 lecture 2 the Boolean model

Lecture 18

Lecture 18

Advanced Computer Architecture Lecture 18

Lecture 18

Lecture 18: The IS/LM Model (continued)

Chapter 5 Advanced Plotting and Model Building

Finite Model Theory Lecture 18

Lecture # 18

Lecture 18

Lecture 18

Lecture 18

Lecture 18

Lecture 18