Linear Regression

Linear Regression

5.2 Introduction • Correlations tell us nothing about the predictive power of variables. • In regression we fit a predictive model to our data and use the model to predict the values of the dependent variable from one or more independent variable. • Outcomei = (Modeli) + errori • The word model in the equation get replaced by some thing that defines the line that we fit to the data. • With any data set there can be many lines that could be used to summarise the general trend. • We need to decide which of many possible lines to choose. • For drawing accurate conclusions we want to fit a model that best describes the data. • There are several ways to fit the line i.e use your eye, or a mathematical technique `‘method of least squares´`

= + + e Y b b X i 0 1 i i 5.2.1 Describing a Straight Line • bi • Regression coefficient for the predictor • Gradient (slope) of the regression line • Direction/Strength of Relationship • b0 • Intercept (value of Y when X = 0) • Point at which the regression line crosses the Y-axis (ordinate)

Same Intercept, Different Gradient

Same Gradient, Different Intercept

5.2.2. The Method of Least Squares • Insert Figure 5.2

5.2.3 Assessing the goodness of fit: sums of squares, R and R2 i.e How Good is the Model? • The regression line is only a model based on the data. • This model might not reflect reality. • We need some way of testing how well the model fits the observed data. • How?

Once we have found the line of best fit it is important that we assess how well this line fits the actual data. If we want to assess the line of best fit, we need to compare it against some thing. The thing we choose is the most basic model we can find. We use the following equation, • to calculate the fit of the most basic model • And then the fit of the best model. If the best model is any good then it should fit the data significantly better than our basic model . Deviation = ∑ (observed – model)2 We choose the mean as the basic model. And then calculate the difference between the observed values and the values predicted by the mean.This sum of squaed differenes is called the total sum of squares (SST) . This represents how good the mean is as model of the observed data. Now if we fit our best model (least squares), we can again find the differences between the obderved data and the new model. These differences are squared and added. This is called sum of square residuals (errors). This represent the degree of inaccuracy when the best model is fitted, (SSR) We can use these two values to calculate how much better the regression line is than just using the mean as a model . SSM = SST – SSR This difference shows the reduction in inaccuracy of the model resulting from fitting the regression model to the data. R2 = SSM / SST (Variation explained by the model / Total variation in the model)

Sums of Squares • Insert Figure 5.3

Summary • SST • Total variability (variability between scores and the mean). • SSR • Residual/Error variability (variability between the regression model and the actual data). • SSM • Model variability (difference in variability between the model and the mean).

SST Total Variance In The Data SSM SSR Improvement Due to the Model Error in Model Testing the Model: ANOVA • If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR

Testing the Model: ANOVA • Mean Squared Error • Sums of Squares are total values. • They can be expressed as averages. • These are called Mean Squares, MS Variation explained by the model Variation not explained by the model

Testing the Model: R2 • R2 • The proportion of variance accounted for by the regression model. • The Pearson Correlation Coefficient Squared

= + + e Y b b X i 0 1 i i 5.2.4 Assessing individual predictors The value b represents the change in the outcome resulting from a unit change in the predictor.If the model is bad the regression coefficent will be 0. This means a unit change in the predictior variable results in no change in the predicted value. This hypotheses is tested using a t test. Null Hypothese: Ho b1= 0 Ha b1 not equal to 0 If it is significant (less than 0.05) we accept the hypotheses that the b value is significantly different from zero and that the predictor variable contributes significantly to the predictor variable. t = b observed – b expected SEb t = b observed SEb

t test • Let us assume we take lots of samples of the data regarding adverts and sales. • Calculate the b value for each sample. • We could plot a frequency distribution for these samples to see whether the b values from all samples are relatively similiar or different. • We can use the standard deviation of this distribution (called standard error SE ) as a measure of the spread of the b values. If the SE is small then it means that most samples have a b value similiar to the one in the sample selected (because there is little variation across samples) • The t test tells whether the b value is different from 0, relative to the variation in the b values for similiar samples. t = b observed – b expected SEb The bexpected is the value of b that we would expect to obtain if the null Hyp is true i.e bexpected = 0 The simplified equation is on the bottom t = b observed SEb

t test • The values of t have a special distribution that differs according to the degree of freedom. • In regression the d.o.f are N – P – 1 N= Sample size, P = No of predictors • Using dof, establish which t distributiion is to be used, • Compare the observed value of t with the values that we would expect to find by chance. • If t is very large as compared to the given values in the table, then it is very unlikely, that it has occured by chance. It is a genuine effect. • SPSS provides the exact probability of the t test value for ``b`` parameter having a value of zero.

5.3 Regression: An ExampleOpen: Record1.sav

Regression Using SPSS

5.4.1 Interpreting a simple regression Model summary tells whether it is successful in predicting sales

SPSS Output: ANOVA ANOVA tests whether the model is significantly better at predicting the outcome than using the mean as the best guess or the model

How do I interpret b values? 5.4.2 SPSS Output: Model Parameters

( ) = + ´ Record Sales 134 . 14 0 . 09612 Advertisin g Budget i i ( ) = + ´ 134 . 14 0 . 09612 100 = 143 . 75 5.4.3 Using The Model

5.5. Multiple Regression: Basics • Outcome = (Modeli) + errori • Yi = (bo + b1 X1 + b2 X2 + ) + error

5.5.1 Example of Regression: Record Sales Company Advertising accounts for 33% of variation in sales. 67 % variation remains unexplained. Therefore a new predictor is introduced to explain some of the unexplained variation in sales. i.e. the number of times the record is played on the radio. Record Sales i = (bo + b1 Advertising Budgeti+ b2 Air Playi + error There is a b value for both predictors. 3 D Graphic model in Fig 5.6 , Page 158

5.5.2 Sum of Squares, R and R2 SST:Represents the difference between the observed values and the mean value of the outcome variable. SSR: Represents the difference between the values of Y predicted by the model and the observed values. SSM: Represents the difference between the values o Y predicted by the model and the mean value. Multiple R is a measure of how well the model predicts the observed data. It is used when there are multiple predictiors. Multiple is a correlation between the observed values of Y and the values of Y prdicted by the regression model Large R Value : large correlation between predicted and observed values of the outcome. M. R = 1 , Model perfectly predicts the observed data R2 : Amount of variation in the out come variable that is accounted for by the model.

5.5.3 Methods of Regression • Hierarchical (Blockwise Entry) • Predictors selected based on past work • The experimenter decides the order of entry. Entering the important ones first. • Forced Entry • All the predictors entred simultaneously (You shall have good theoretical reasons to include the chosen predictors) • Stepwise Methods The order in which the predictors are added is a mathematical criteria. Forward Method: • The initial model contains only the constant bo. • The computer searches for the model that best predicts the outcome variable (By selecting the variable having the highest correlation with the outcome. • If the predictior sigificantly improves the ability of the model to predict the outcome, then the predictor is retained and the computer searches for another predictor. • The criteria for the 2nd predictior is that it is the variable that has the largest semi partial correlation with the outcome. The predictor that accounts for the most new variance is added to the model and if it makes a significant contribution to the predictive powerof the model, it is retained and another predictor is considered. Stepwise Method • Similiar to forward method, except that each time a predictor is added to the equation, a removal test is made of the least useful predictor.The regression equation is constantly being assessed for any redundant predictors to be removed. Backward Method: • The computers places all the predictors in the model and then calculates the contribution of each by looking at the significance value of the t test for each predictor.

Assessing: Does the model fit the observed data well? 5.6.1 How Accurate is the my regression model Generalisation: Can my model generalise to other samples? 5.6.1.1 Outliers and residuals • Outlier: A case that differs substantially from the main trend • How to find it? Use residuals. (The difference between values of the outcome predicted by the model and the values of the outcome observed in the model. • Unstandardised Residuals • Standardised Residuals (Find Z Scores ) • Studentised Residuals 5.6.1.2 Influential Cases (Do certain cases exert undue influence over the parameters of the model i.e. If we delete a case do we obtain a different regression coefficient. There are several residual statistics to assess the influence of a particular case. • Adjusted predicted value (When a case is excluded from the analysis) • DFFit = Adjusted predicted value – Original Predicted Value • Standardised DFFit • Deleted Residual = Adjusted predicted value – Original Observed Value • Studentised deleted residual. • How the case influence the model as a whole • Cooks Distance: A statistics that considers the effect of a single case on the model. Values greater than 1 are cause of concern. • Leverage (Values can lie between 0=case has no influence and 1=case has complete influence) • Mahalanobis distances: Values above 25 are a cause of concern. • DFBeta: The difference between a parameter estimated using all cases and then estimated when one case is excluded • Standardised DF Beta (Universal cut off points can be applied) Values greater than 1 indicate cases that substantially influence the model parameters, • Covariance ratio: This is a measure whether a case influences the variance of the Regression parameters. 95% Z Scores lie within +- 1.96 99% Z Scores lie within +- 2.58 99.9% Z Scores lie within +- 3.29

In social science we are interested in generalising our findings outside the sample. 5.6.2 Assessing the generalisation of the model 5.6.2.1 Several Assumptions must be true5.6.2.2.Cross validation of the model5.6.2.3. Sample size in Regression5.6.2.4. Multicollinearity 5.6.2.1 Several assumptions must be true • Variables types: • Predictor variables must be quantitiave or categorical • Outcome variable must be quantititvie, continious and unbounded • Non Zero variance: The predictors should have some variance • No Perfect multicollinearity: No perfect linear relationship between two or more predictors. • Predictors are uncorrelated with external variables. • Homosedasticity: At each level of the predictor variable, the variance of the residual terms should be constant. • Independent errors (Durbin Watson Test) • Normally distributed errors • Independence • Linearity

5.6.2.2. Cross validation of the model Asessing the accuracy of the model across different samples is known as cross validation. There are two main methods of cross validation. • Adjusted R2: This indicates the loss of predictive power or shrinkage. R2 tells us how much of the variance in Y is accounted for by the regression model from our sample, the adjusted value tells us how much variance in Y would be accounted for if the model had been derived from the population from which the sample had been taken • Data Splitting: Randomly splitting your data in half, computing a regression equation on both halves of data and comparing the resulting models.

5.6.2.3 Sample size in regression Rules of thumb

5.7 Multiple Regression Using SPSS • 5.7.1 Main Options • 5.7.2 Statistics • Regression Plots • Saving Regression diagnostics • Further Options

5.8 Interpreting Multiple Regression • 5.8.1 Descriptives • 5.8.2 Summary of Model • 5.8.3 Model Parameters • 5.8.4 • 5.8.5 Assessing the assumptions of multicollinearity • 5.8.6 Case wise diagnostics • Checking assumptions

5.10 Categorical predictors and regression • 5.10.1 Dummy Coding Is a way of representing variables using 0s and 1s. The number of variables we need is one less than the number of groups we are coding

The biologist categorised people according to their musical affiliation. People liking alternative music – indie kid People kiknig heavy metal – metaller People liking hippy/folky – crusty People not liking music – no affiliation Example: Glastonbury festival

Linear Regression