Non-Linearity in Data Analysis

Introduction to Data Analysis. Non-linearity, heteroskedasticity, multicollinearity, oh my!

Last week’s lecture • Extended our linear regression model to include: • Multiple independent variables (so as to get rid of spurious relationships) • Categorical independent variables (by adding dummy variables for each category minus one). • Interactions between independent variables (e.g. is there a relationship between x and y, only when z is high. • These types of regression models are used all the time.

This week’s lecture • This week we have a look at what happens when some of the underlying assumptions behind linear regression are not met. • Non-linear relationships. • Different amounts of variation at different levels of an independent variable. • ‘Outliers’. • Independent variables that are highly correlated. • Reading for this week. • A & F chapter 14

Before all this though… • Want to think a bit more about ‘model building’. • Why do we include some independent variables and not others in a multiple regression? • When is a model a ‘good model’ and when is it a ‘bad model’? • Is this an art or a science? Are there ‘rules’ that we should follow for fitting a model? • I have some data, how should I analyze it?

What not to do (1) • I have a sample of 1000 people that measures 100 characteristics of those people regarding a variety of things including: • Belief in fairies. • Father’s occupation. • Father’s height. • Do they prefer oranges to toffee. • And so on… • I want to predict individuals’ heights. • One way of doing this with my data would be to include every variable in a multiple regression and see which were statistically significant.

What not to do (2) • This is nonsensical way to proceed of course. • Causation. Can belief in fairies really explain height? • Even if none of those 100 variables were actually related to height, we would expect about 5 to be statistically significant at the 5% level just by chance. • If we include every variable, do we include every possible interaction as well (this will be a lot of interactions)? • Independent variables are often correlated with one another (more on this later today) so including them all doesn’t really make sense.

What to do (1) • Generally we have a previous empirical model (or models) based on a theory (or theories). • An appropriate way to proceed might be to start with this model. • We then might introduce any other variables that we think are actually important (according to some tweak or major change to the underlying theory) or interactions between variables. • e.g. previous work suggests that father’s height predicts their son’s height. We think fathers that worked as miners also have shorter sons, so we include this.

What to do (2) • For more exploratory models. • Include variables with reference to theory. • Interpret statistically significant variables appropriately. • Generally KEEP IT SIMPLE. • Our models are meant to help us understand reality. • Unnecessary variables are likely to increase the size of standard errors for all variables, and make everything more difficult to interpret. • If we have strong prior beliefs that a variable should be important, then it should generally be left in the model even if not statistically significant. • Small data-sets mean that you can’t include many independent variables.

What to do (3) • R2 tells you something about the predictive power of your model, but don’t over-interpret R2values for your model. • Small R2 values don’t mean that the model is ‘bad’, there could be a lot of intrinsic variation or measurement error. • What is normally more important is whether there is any structure to the variation that you can’t explain.

Residuals • We’re trying to predict Y, but we’re never going to be spot on. • The deviation of the observation from our actual prediction (the e in our equation) is called the residual. • Residuals should be randomly distributed around zero (it’s inherent variation that we can’t predict). • Examining graphs of residuals can help us to work out whether we can improve our model.

An example • Let’s imagine I was interested in how many people came to these lectures every week. • For the last 5 years I counted how many students turned up every week. • My hypotheses are that: • As term goes on, less people come to the lectures. • On days after the Oxford beer festival less people come to lectures.

Some data Regression line predicting number of people using week number Post beer-festival days

Multiple regression • Before we pronounce this a brilliant model we should examine plots of the residuals against each X variable to see if there is any structure. • Each residual is the actual value we observe (y) minus the predicted value (Ŷ ), what we have been calling e. • Is there a pattern to the residuals? Let’s look at week number.

Residuals against week number Later weeks we predict too few people Early weeks we predict too few people Middle weeks we predict too many people

A non-linear relationship • There is a pattern here, we don’t think that what we see is just due to inherent variation. • Given we under-predict number of students at the beginning and end of term, and over-predict the number in the middle of term, it looks like the relationship between week number and attendance is non-linear. • i.e. for different values of week number the slope of the line is different. At the beginning of term attendance falls off sharply as the work-shy depart, but then stabilises towards the middle of term and remains at similar levels until the end. • If we had more explanatory variables this would be difficult to spot without the residuals graph.

Dealing with non-linearity • Fortunately it’s not that difficult to deal with non-linear relationships. • There are three main options: • Include week number and the square of week number (i.e. multiplied by itself) in our model. • ‘Transform’ week number (normally by taking the log of the independent variable). • Using a different kind of regression that lets you fit a squiggly line to the data.

Squared terms (1) • The most commonly used way to take account of non-linearity is to use polynomial regression functions. • Depending on the value of β2 the function will be convex (if β2 < 0) or concave (if β2> 0). • Convex shapes are like an eye, and concave shapes like a bowl.

Squared terms (2) • If we add an X2 term to our model of lecture attendance then the coefficients look as follows. • Calculate a SE and p-value for the X2 term. Since it is statistically significant we have good evidence that the relationship is indeed non-linear.

Squared terms (3)

Another residuals graph A random scatter of residuals, with no apparent pattern (what we want to see generally)

Squared terms (4) • The residuals are now much more randomly scattered around zero. • Note though that there is a little more scatter for weeks at the beginning of term than at the end (more on this in a minute). • Using squared terms is particularly useful when the relationship ‘goes up and down’. • e.g. my example of frequency of sex and happiness. As X increases, there is an increase in Y and then eventually a decrease.

Logs (1) • Using squared terms is not the only way of dealing with non-linearity. Sometimes we transform our independent variable by taking its log, and then using the logged value as our independent variable. • This is especially useful when we think that the relationship looks like an exponential function. Β < 1 Β > 1 Y Y X X

Logs (2) • If you think of a log-scale, then the distance between 1 and 2, is the same as 2 and 4, and 4 and 8 and so on. Distance from 1 to 2 Distance from 2 to 4 Distance from 4 to 8

Logs (3) • Taking the log of a variable means that high values will now be more bunched together and low values will be more spread apart. • This means that some curved relationships between X and Y (like in our example), will be close to a linear relationship between log(X) and Y. • While this can be a useful procedure, you need to remember to be careful in interpreting coefficients (since it will be a unit change in the log(X) not X). • Equally, making predictions, you need to remember to input the log(X) into your equation not X.

Non-linearity summary • Generally we include a squared term, this is especially good at accounting for the relationship when the slope goes up and down. • Sometimes we want to include a log of the X variable. This is particularly good when we think the relationship behaves in an exponential like manner. • But beware… • Be very careful extrapolating from these models (our 1st model predicts an increase in attendance after week 8…). • There are more sophisticated methods to deal with non-linearity.

Breaking the rules • Non-linearity is an example of the regression rules not holding, but there are others that we can deduce from graphs of the residuals. • An important rule breaker is something called heteroskedasticity. • This tongue-twister just means that the values of Y are more variable at some levels of X compared to others. • If these are big differences then they violate one of the assumptions we made behind OLS regression. • Remember from previous lectures:

Y, given X=40 The standard deviation at each level of X is assumed to be the same. Y, given X=80 Y, given X=120 Regression line for the population (the line goes through the mean of Y for each value of X). Mean Y, given X=40

Constant standard deviation? • So what if it isn’t constant? Can check this from looking at graphs of the residuals. • Take an example, we wish to predict the amount of tweed garments owned by university lecturers. • A sample of 100 people; obvious predictors are sex and age.

Residuals against age Residuals have ‘low’ variation Residuals have ‘high’ variation

Tweed and variation • So in our case, for older people there appears to be more inherent variation than for younger people.

Does this matter? • YES. If the residuals are more spread out for some values of X than others then the standard errors that we calculated will not be correct. • What we need to do is calculate robuststandard errors. • These are generally larger than the normal standard errors, and compensate for the fact that for some levels of X there is more variation than at others. • If you don’t do this and use normal standard errors, then you could have results which look statistically significant but are not.

Outliers • Non-linearity and heteroskedasticity are two examples of structure left after we have fitted our model that we want to explain. • But, often we’re also interested in particular individual observations that don’t fit our predictions. • These ‘outlying’ observations can sometimes radically change our regression equation. • They can also help us to think about other independent variables that may be important.

Another example • Let’s think about a model with only a few cases. • We might be interested in the turnout (what percentage of people vote) around the world. • A reasonable hypothesis would be that the more competitive the party system was the more likely people are to bother voting. • We can model turnout using ‘competiveness’ as a independent variable in a regression.

Turnout (1) Outliers Belgium Australia Regression line

Turnout (2) • More generally when we had many variables in our model (say electoral system type, amount of corruption, GNP per capita and so on), we would want to plot the residuals against each independent variable. • In this particular case it looks like we need to add another independent variable. Both Australia and Belgium have compulsory voting, so we need to include this as a dummy variable. • Outliers often give you a hint about other predictors that you may want to include in your model. • But what if they don’t…

What to do with outliers • Perhaps the first thing to do is to work out how outlying the outliers are, and whether it makes much difference to the model including those outlying observations. • How big is a particular residual? • How much difference does a particular observation make to the estimate of the coefficients (this is sometimes called leverage)?

What’s a big residual? • We normally standardize residuals to try and work out whether they’re big or not. • These are called studentized residuals and are just the actual value of the residuals divided by the standard deviation we would expect from normal sampling variability. • This is like a z-statistic, so about 5% of values should be above 1.96 or below 1.96. • This gives us an idea of how outlying the outliers are. • Let’s run this for the turnout data.

Studentized residuals We wouldn’t expect to see these residuals by chance very often at all Belgium and Australia

Are they ‘influential’? • So we know that Australia and Belgium are ‘big’ residuals, but do they make much difference to the model? • Best way to think of this is with simple linear regression, and where on our graph would a point far off the estimate make the most difference to the line? • We’d shift the line the most when the points are far away from the mean (just like a lever is most effective when you push the end rather than the middle, and this is known as leverage).

An exemplary example • We have suffered from my lack of imagination and hence declining example quality over the last few weeks. • Indeed there is a strong negative relationship between week number and example quality (on a 1-10 scale). As weeks go on my examples become worse. • The interesting thing is if I improved the examples in some weeks, I would make a big difference to the relationship. • Whereas I could let things slide in some weeks and make little difference to the relationship.

Leverage (1) Regression line. Week number is predicting quality of Ryan’s examples (on a 1-10 scale) Decrease this point by 2 Increase this point by 2

Leverage (2) • The point being that changes in observations far from the mean make more difference to the regression line. • What we often do is calculate a diagnostic called DFBETA, this tells us the effect of removing the observation on each parameter estimate in the model. • We get a high DFBETA (more than 1) for a parameter when the observation has a big residual and has a lot of leverage. • What do we do with this information?

Dealing with outliers • Check the data. • Are the variables all measured correctly for that observation? • Is there a missing variable? • Could something else (like compulsory voting in our turnout case) account for the outliers? • Delete the observation…? • This is generally not a good idea unless you have some reason that overlaps with one of the above, it’s a real observation after all. • A good idea is to run everything without the outlier and see if anything much changes. If the ‘interesting’ relationships are dependent on this then you need to be cautious interpreting them. • ‘Robust regression’ is less affected by outliers (next semester).

Some more long words • The final mouthful that we should worry about is multicollinearity. • This is a lot simpler than it sounds. It just means that the independent variables that we are interested in are closely related. • When one of them increases the others increase as well, making it fairly difficult to work out the separate effect of each predictor. • This is a common problem in social science because our variables often ‘overlap’ a lot.

Multicollinearity (1) • Take attitudes to the EU, could ask people whether they like the idea of a common foreign policy and common defence policy, and use these questions to predict vote choice at EU elections. • In reality the answers to these questions (on a 1-10 scale say) are highly correlated. • People either like the EU or they don’t.

Multicollinearity (2) Correlation between our two independent variables is 0.98. When foreign policy goes up, defence policy goes up. We can’t work out what happens when foreign policy goes up and defence stays the same.

Multicollinearity (3) • We can measure this easily, if there are high correlations between our independent variables then the SEs will be high and it will be difficult to discern statistically significant results. • Regression coefficients are much more difficult to interpret, because it might not make sense to say as X1 changes and X2 is held constant if they actually move together.

What to do? • We could just include one of the highly correlated independent variables. • This makes sense when the causal relationships are quite clear. • We could make a scale with our highly correlated independent variables. • This makes sense when there is an underlying variable that we haven’t measured (dislike of the EU here). • In experimental designs, make sure that the independent variables are uncorrelated.

A penultimate problem • Two examples today (of ‘example quality’ and lecture attendance) used time as an independent variable. • We have to be careful when we do this, because there is often serial correlation between the residuals. • That means that when we’re bad at predicting something in August, we’d be bad at predicting it in September. • Imagine our predictor of the weather is the season. We don’t predict that it will be sunny in winter. • It is sunny on December 1st (so we have a large residual), and it’s likely to be sunny on December the 2nd (and again we have a large residual). When one residual is high, the next one is likely to be high.

Time-series data • This kind of data requires something called time-series analysis.

Non-Linearity in Data Analysis

Non-Linearity in Data Analysis

Presentation Transcript

Introduction to Secondary Data Analysis

Introduction to Data Analysis

Introduction to Survey Data Analysis

Introduction to Data Analysis

Introduction to Data analysis

Introduction to Data Analysis.

Introduction to Data Analysis.

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis.

Introduction to Data Analysis

Introduction to Longitudinal Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to RHESSI Data Analysis