Understanding Regression Methods: A Short Course by LISA at Virginia Tech

Regression Methods Olawale Awe LISA Short Course, Department of Statistics, Virginia Tech. November 21, 2013.

About • What? • Laboratory for Interdisciplinary Statistical Analysis • Why? • Mission: to provide statistical advice, analysis, and education to Virginia Tech researchers • How? • Collaboration requests, Walk-in Consulting, Short Courses • Where? • Walk-in Consulting in GLC and various other locations (www.lisa.stat.vt.edu/?q=walk_in) • Collaboration meetings typically held in Sandy 312 • Statistical Collaborators? • Graduate students and faculty members in VT statistics department

Requesting a LISA Meeting Go to www.lisa.stat.vt.edu Click link for “Collaboration Request Form” Sign into the website using VT PID and password Enter your information (email, college, etc.) Describe your project (project title, research goals, specific research questions, if you have already collected data, special requests, etc.) Contact assigned LISA collaborators as soon as possible to schedule a meeting

Collaboration: Visit our website to request personalized statistical advice and assistance with: Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use ofStatistics LISA also offers: Educational Short Courses:Designed to help graduate students apply statistics in their research Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30mins Also 11AM-1PM Port (Library/Torg Bridge) and 9.30-11.30 AM ICTAS Café X All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu

Outline Introduction to Regression Analysis Simple Linear Regression Multiple Linear Regression Regression Model Assumptions Residual Analysis Assessing Multicollinearity: Correlation and VIF Model Selection Procedures Illustrative Example (Brief Demo with SPSS/PASW) Model Diagnostic and Interpretation

Introduction Regression is a statistical technique for investigating, describing, and predicting the relationship between two or more variables. Regression has been regarded as the most widely used technique in statistics. As basic to statistics as the Pythagorean theorem is to geometry (Montgomery et al,2006).

Regression: Intro Regression Analysis has tremendous applications in almost every field of human endeavor. One of the most popular statistical techniques used by researchers. Widely used in engineering, physical and chemical sciences, economics, management, social sciences, life and biological sciences, etc. Easy to understand and interpret. Simply put, Regression analysis is used to find equations that fit data.

When do we use Regression Technique?

Simple Linear Regression

Simple Linear Regression Simple Linear Regression (SLR) is a statistical method for modeling the relationship between ONLY two continuous variables. A researcher may be interested in modeling the relationship between Life Expectancy and Per Capita GDP of seven countries as follows. Scatterplots are first used to graphically examine the relationship between the two variables.

Types of Relationships Between Two Continuous Variables A scatter plot is a visual representation of the relationship between two variables. Positive and negative linear relationship

Other Types of Relationships… Curvilinear Relationships No Relationship

Simple Linear Regression Can we describe the behavior between the two variables with a linear equation? The variable on the x-axis is often called the explanatory or predictor variable(X). The variable on the y-axis is called the response variable(Y).

Simple Linear Regression Model The Simple Linear Regression model is given by where is the response of the ith observation is the y-intercept is the slope is the value of the predictor variable for the ith observation is the random error

Interpretation of Slope and Intercept Parameter β1 is the difference in the predicted value of Y for one unit difference in X. β0 is the mean response if the predictor variable is zero(has no practical meaning but should be included) If β1>0 there exists positive relationship .It means as variable X increases, Y also increases. If β1 <0 there exists negative relationship between the variables. It means that as variable X decreases, Y increases. If Β1=0, It means there is no relationship between the two variables(see graphs below).

Graphs of Relationships Between Two Continuous Variables β1>0 β1<0 β1=0

Line of Best fit A line of best fit is a straight line that best represents your data on a scatter plot. Identical to line of a straight line in elementary math class. y=mx+b ,m=slope, b= y-intercept. Residual is r= y- E(r)=0(more on residual later) Where y=observed response =predicted response.

Regression Assumptions • Linearity between the dependent and independent variable(s). • Observations are independent • Based on how data is collected. • Check by plotting residuals vs the order in which the data was collected. • Constant variance of error terms. • Check using a residual plot (plot residuals vs. ) • The error terms are normally distributed. • Check by making a histogram or normal quantile plot of the residuals.

Example 1 Consider a data on 15 American Women collected by a researcher as follows: We can fit a model of the form: Weight =β0 +β1Age+ϵ to the data.

Scatter Plot of Weight vs Age Line of best fit

Model Estimation and Result = =r* The estimated regression line is Weight =-87.52+3.45Age Can you interpret these results?

Description/Interpretation The above results can be interpreted as follows: -Sig.(P value) of 0.000 indicates that the model is a good fit to the data. It means Age has a significant contribution to the average variability in the weights of the women. -The value of β1 (slope=3.45) indicates a positive relationship between the weight and age. The slope coefficient indicates that for every additional unit increase in age, we can expect weight to increase by an average of 3.45 kilograms. -R indicates that there is high association between the DV and the predictor variable. R-Squared value of 0.991 means that 99% of the average variability in weight of the women is explained by the model.

Prediction Using the regression model above, we can predict the weight of a woman who is 73 years old : Weight = -87.52 + 3.45(75) Weight = -87.52 +3.45*75 Weight =171 Exercise: -Using the SLR model above, predict the weight of a woman whose age is 82. Ans: 195kg

MULTIPLEREGRESSION

Frequently there are many predictors that we want to use simultaneously • Multiple linear regression model: • Similar to simple linear regression, except now there is more than one explanatory variable. • In this situation each represents the partial slope of the predictor . • Can be interpreted as “the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. ”.

Example 2: Suppose the researcher in our example 1 above is interested in knowing if height also contributes to change in weight:

Step 1: Scatterplots

Model Estimation with SPSS

Multiple Regression • The new model is therefore written as Weight = + Error So, the fit is : Weight = -81.53+3.46Age -1.11Height

Model Interpretation The result of the model estimation above shows that height does not contribute to the average variability in weight of the women. The high p-value shows that it is not statistically significant(changes in height are not associated with changes in weight). No statistically significant linear dependence of the mean of weight on height was detected. Note that the value of R-Squared and Adjusted R-Squared did not decrease as we add additional independent variable(s). For every one unit increase in age, average weight increases at a rate of 3.46units, while holding height constant.

Model Diagnostic and Residual Analysis Residual is a measure of the variability in the response variable not explained by the regression model. Analysis of the residual is always an effective way to discover several violations of model assumption. Plotting residuals is a very effective way to investigate how well the regression model fits the data. A residual plot is used to check the assumption of constant variance and to check model fit (can the model be trusted?).

Diagnostics: Residual Plot The residuals should fall in a symmetrical pattern and have a constant spread throughout its range. Good residual plot: no pattern.

We Can Plot: Residual vs Independent Variable(s) Residual vs Predicted values Residual vs Order of the data Residual Lag Plot Histogram of Residual Standardized Residual vs Standardized Predicted Value etc.

Residuals Column 3 in the table below show the residuals of the regression model: Weight =-87.52+3.45Age Residual is the deviation between the data and the fit. (Actual Y- Predicted Y)

Residual Diagnostics: Very Important! Left: Residuals show non-constant variance. Right: Residuals show non-linear pattern.

Look at the Figures Below, What Do You Think?

Residual Plot

Residual Plots

What if the Assumptions Are Not Met? • Linearity: • Transform the dependent variable (see next slide ) • Normality: • Transform the data (also when outlier is present) • Or use robust regression where normality is not required • Increase the sample size, if possible • Homogeneity: • Try transforming the data

Some Tips on Transformation Log Y, -Used if Y is positively skewed and has positive values. -If Y has a Poisson distribution(is a count data) 1/Y -If variance of Y is proportional to the 4th power of E(Y) Sin-1 (Y) -Used if Y is a proportion or rate

Multicollinearity A usual problem in multiple regression that develops when one or more of the independent variable(s) is highly correlated with one or more of the other independent variables. How the explanatory variables relate to each other is fundamental to understanding their relationship with the response variable. Usually, when you see estimated beta weights larger than 1 in any regression analysis, consider the possibility of multicollinearity. Multicollinearity can be mild, or severe(depending high correlations, or VIFs above 10).

Effects on P-Values You will get different p-values for the same variables in different regressions as you add/remove other explanatory variables. A variable can be significantly related to Y by itself, but not be significantly related to Y after accounting for several other variables. In that case, the variable is viewed as redundant. If all the X variables are correlated, it is possible ALL the variables may be insignificant, even if each is significantly related to Y by itself.

MulticollinearityEffect on Coefficients Similarly, coefficients of individual explanatory variables can change depending on what other explanatory variables are present. May change signs sporadically. May be excessively large when there is multicollinearity.

Multicollinearity Isn’t Tragic In most practical datasets there will be some degree of multicollinearity. If the degree of multicollinearity isn’t too bad (more on its assessment in the next slides) then it can be safely ignored. If you have serious multicollinearity, then your goals must be considered and there are various options. In what follows, we first focus on how to assess multicollinearity, then what to do about it should it be found to be a problem.

Assessing Multicollinearity: Two Methods • There is typically a measure of multicollinearity in most experiments. • We discuss two methods for assessing multicollinearity in this course: • (1)Correlation matrix • (2)Variance Inflation Factor(VIF)

Correlation Matrices A correlation matrix is simply a table indicating the correlations between each pair of explanatory variables. If you haven’t seen it before, the correlation between two variables is simply the square root of R2, combined with a sign indicating a positive or negative association. If you see values close to 1 or -1 that indicates variables are strongly associated with each other and you may have multicollinearity problems. If you see many correlations all greater in absolute value than 0.7, you may also have problems with your model.

Correlation Matrix A cursory look at the correlation matrix of the independent variables shows if there is multicollinearity in our experiment.

Correlation Matrix Involving the DV Can help to assess the preliminary idea of the bivariate association of the dependent variable with the independent variables.

Disadvantages of Using Correlation Matrices Correlation matrices only work with two variables at a time. Thus, we can only see pairwise relationships. If a more complicated relationship exists, the correlation matrix won’t find it. Multicollinearity is not a bivariate problem. Use VIFs!

Variance Inflation Factors (VIFs) Variance inflation factors measure the relationship of all the variables simultaneously, thus they avoid the “two at a time” disadvantage of correlation matrices. They are harder to explain. There is a VIF for each variable. Loosely, the VIF is based on regressing each variable on the remaining variables. If the remaining variables can explain the variable of interest, then that variable has a high VIF.

Understanding Regression Methods: A Short Course by LISA at Virginia Tech