Chapter 13 Multiple Regression Analysis

Chapter 13 Multiple Regression Analysis General Objectives: In this chapter, we extend the concepts of linear regression and correlation to a situation where the mean value of a random variable is related to several independent variables—x1, x2, x3, …, xk—in models that are more flexible than the straight-line model of Chapter 12. With multiple regression analysis, we can use the information provided by the independent variables to fit various types of models to the sample data, to evaluate the usefulness of these models,and finally to estimate the average value of y or predict the actual value of y for given values of x1, x2, x3, …, xk . ©1998 Brooks/Cole Publishing/ITP

Specific Topics 1. The general linear model and assumptions 2. The method of least squares 3. Analysis of variance for multiple regression 4. Sequential sums of squares 5. The analysis of variance f test 6. The coefficient of determination, R2 7. Testing the partial regression coefficients 8. Adjusted R2 9. Residual plots ©1998 Brooks/Cole Publishing/ITP

10. Estimation and prediction using the regression model 11. Polynomial regression model 12. Qualitative variables in a regression model 13. Testing sets of regression coefficients 14. Stepwise regression analysis 15. Causality and multicollinearity ©1998 Brooks/Cole Publishing/ITP

13.1 Introduction • Multiple linear regression is an extension of the methodology of simple linear regression to allow for more than one independent variable. • For example, a company’s regional sales y of a product might be related to three factors: - x1—the amount spent on television advertising - x2—the amount spent on newspaper advertising - x3—the number of sales representatives assigned to the region. • Several questions arise: - How well does the model fit? - How strong is the relationship between y and the predictor variables? - Have any important assumptions been violated? - How good are estimates and predictions? ©1998 Brooks/Cole Publishing/ITP

13.2 The Multiple Regression Model • The general linear model for a multiple regression analysis describes a particular response y using the model given next. General Linear Model and Assumptions where - y is the response variable that you want to predict - b0, b1, b2, …, bk are unknown constants. - x1, x2, …, xk are independent predictor variables that are measured without error. - e is the random error, which allows each response to deviate from the average value of y by the amount e. ©1998 Brooks/Cole Publishing/ITP

You can assume that the values of e (1) are independent; (2) have a mean of 0 and a common variance s 2 for any set x1, x2, …, xk; and (3) are normally distributed. When these assumptions about e are met, the average value of y for a given set of values x1, x2, …, xk is equal to the deterministic part of the model: Example 13.1 Suppose you want to relate a random variable y to two indepen-dent variables x1 and x2. The multiple regression model is with the mean value of y given as This equation is a three-dimensional extension of the line of means from Chapter 12 and traces a plane in three-dimensional space. ©1998 Brooks/Cole Publishing/ITP

The constant b0 is called the intercept—the average value of y when x1 and x2 are both 0. The coefficients b1 and b2 are called the partial slopes or partial regression coefficients. The partial slope bi (for i=1 or 2) measures the change in y for a one-unit change in xi when all other independent variables are held constant. The value of the partial regression coefficient— say b1—with x1 and x2 in the model is generally not the same as the slope when you fit a line with x1 alone. These coefficients are the unknown constants, which must be estimated using sample data to obtain the prediction equation. ©1998 Brooks/Cole Publishing/ITP

13.3 A Multiple Regression Analysis • A multiple regression analysis involves estimation, testing, and diagnostic procedures designed to fit the multiple regression model to a set of data. The Method of Least Squares The prediction equation is the line that minimizes SSE, the sum of squares of the devi-ations of the observed values y from the predicted values These values are calculated using a regression program. See Example 13.2. ©1998 Brooks/Cole Publishing/ITP

Example 13.2 How do real estate agents decide on the asking price for a newly listed home? A computer database in a small community contains the listed selling price y (in thousands of dollars), the amount of living area x1 (in hundreds of square feet), and the number of floors x2, bedrooms x3, and bathrooms x4, for n= 15 randomly selected residences currently on the market. The data are shown in Table 13.1. The multiple regression model is which is fit using the Minitab software package. You can find instructions for generating this output in the section “About Minitab” at the end of this chapter. The first portion of the regression output is shown in Figure 13.2. You will find the fitted regression equation in the first two lines of the printout: ©1998 Brooks/Cole Publishing/ITP

The Analysis of Variance for Multiple Regression The analysis of variance divides the total variation in the response variable y, into two portions: - SSR (sum of squares for regression) measures the amount of variation explained by using the regression equation. - SSE (sum of squares for error) measures the residual variation in the data that is not explained by the independent variables. • The values must satisfy the equation Total SS = SR + SSE. • There are (n - 1)degrees of freedom. • There are k regression degrees of freedom. • There are (n - 1) - k degrees of freedom for error. • MS = SS/df ©1998 Brooks/Cole Publishing/ITP

The ANOVA table for the real estate data in Table 13.1 is shown in the second portion of the printout in Figure 13.3: • The conditional or sequential sums of squares each account for one of the k = 4 regression degrees of freedom. Testing the Usefulness of the Regression Model In multiple regression, thereis more than one partial slope—the partial regression coefficients. The t and F tests are no longer equivalent. ©1998 Brooks/Cole Publishing/ITP

The Analysis of Variance F Test Is the regression equation that uses the information provided by the predictor variables x1, x2, …, xk substantially better than the simple predictor that does not rely on any of the x-values? - This question is answered using an overall F test with the hypotheses At least one of b1, b2, …, bk is not 0. - The test statistic is found in the ANOVA table (Figure 13.3) as F = MSR/MSE. • The Coefficient of Determination, R2 - The regression printout provides a statistical measure of the strength of the model in the coefficient of determination. - The coefficient of determination is sometimes called multiple R2 and is found in the first line of Figure 13.3. ©1998 Brooks/Cole Publishing/ITP

- The F statistic is related to R2 by the formula so that when R2 is large, F is large, and vice versa. • Interpreting the Results of a Significant Regression Testing the Significance of a Partial Regression Coefficients - The individual t test in the first section of the regression printout are designed to test the hypotheses: for each of the partial regression coefficients, given that the other predictor variables are already in the model. - These tests are based on the Student’s t statistic given by which has df = (n - k - 1) degrees if freedom. ©1998 Brooks/Cole Publishing/ITP

The Adjusted Value of R2 - An alternative measure of the strength of the regression model is adjusted for degrees of freedom by using mean squares rather than sums of squares: - An alternative measure if the strength of the regression model is adjusted for degrees of freedom by using mean squares rather than sums of squares: - For the real estate data in Figure 13.3, which is provided right next to “R-Sq(adj).” ©1998 Brooks/Cole Publishing/ITP

-The value of R2(adj) = 96.0% can be said to represent the percentage of variation in the response y explained by the independent variables, corrected for degrees of freedom. - The adjusted value of R2 is mainly used to compare two or more regression models that use different numbers of indepen-dent predictor variables. • Checking the Regression Assumptions - You should look at the computer-generated residual plots to make sure that all the regression assumptions are valid. -The normal probability plot and the plot of residuals versus fit are shown in Figure 13.5 for the real estate data. ©1998 Brooks/Cole Publishing/ITP

Using the Regression Model for Estimation and Prediction The model can be used for : - Estimating the average value of y—E(y)—for given values of x1, x2, …, xk - Predicting a particular value of y for given values of x1, x2, …, xk Remember the prediction interval is always wider than the confidence interval. The printout in Figure 13.6 shows confidence and prediction intervals. ©1998 Brooks/Cole Publishing/ITP

13.4 A Polynomial Regression Model • When you perform multiple regression analysis, use a step-by-step approach: 1. Obtain the fitted prediction model. 2. Use the analysis of variance F test and R2 to determine how well the model fits the data. 3. Check the t tests for the partial regression coefficients to see which ones are contributing significant information in the presence of the others. 4. If you choose to compare several different models, use R2(adj) to compare their effectiveness 5. Use-computer generated residual plots to check for violation of the regression assumptions. ©1998 Brooks/Cole Publishing/ITP

The quadratic model is an example of a second-order model because it involves a term whose components sum to 2 (in this case, x2). • It is also an example of a polynomial model—a model that takes the form ©1998 Brooks/Cole Publishing/ITP

13.5 Using Quantitative and Qualitative Predictor Variables in a Regression Model • The response variable y must be quantitative. • Each independent predictor variable can be either a quantitative or a qualitative variable, whose levels represent qualities or characteristics and can only be categorized. • We can allow a combination of different variables to be in the model, and we can allow the variables to interact. • A quantitative variablex can be entered as a linear term, x, or to some higher power such as x2 or x3. • You could use the first-order model: ©1998 Brooks/Cole Publishing/ITP

We can add an interaction term and create a second-order model: • Qualitative predictor variable are entered into a regression model through dummy or indicator variables. • If each employee included in a study belongs to one of three ethnic groups—say, A, B, or C—you can enter the qualitative variable “ethnicity” into your model using two dummy variables: ©1998 Brooks/Cole Publishing/ITP

The model allows a different average response for each group. • b1 measures the difference in the average responses between groups B and A, while b2 measures the difference between groups C and A. When a qualitative variable involves k categories, (k- 1)dummy variables should be added to the regression model. ©1998 Brooks/Cole Publishing/ITP

13.6 Testing Sets of Regression Coefficients • Suppose the demand y may be related to five independent variables, but that the cost of measuring three of them is very high. • If it could be shown that these three contribute little or no information, they can be eliminated. • You want to test the null hypothesis H0 : b3=b4=b5 = 0—that is, the independent variables x3, x4, and x5 contribute no infor-mation for the prediction of y—versus the alternative hypothesis: Ha : At least one of the parameters b3, b4, or b5 differs from 0 —that is, at least one of the variables x3, x4, or x5 contributes information for the prediction of y. ©1998 Brooks/Cole Publishing/ITP

To explain how to test a hypothesis concerning a set of model parameters, we define two models: Model One (reduced model) Model Two (complete model) terms in additional terms model 1 in model 2 • The test of the null hypothesis versus the alternative hypothesis Ha:At least one of the parameters differs from 0 ©1998 Brooks/Cole Publishing/ITP

uses the test statistic where F is based on df1= (k-r) and df2=n-(k+ 1). • The rejection region for the test is identical to the rejection forall of the analysis of variance F tests, namely

13.7 Interpreting Residual Plots • The variance of some types of data changes as the mean changes: - Poisson data exhibit variation that increases with the mean. - Binomial data exhibit variation that increases for values of pfrom .0 to .5, and then decreases for values of p from .5 to 1.0. • Residual plots for these types of data have a pattern similar to that shown in Figure 13.16. ©1998 Brooks/Cole Publishing/ITP

If the range of the residuals increases as increases and you know that the data are measurements of Poisson variables, you can stabilize the variance of the response by running the regression analysis on • If the percentages are calculated from binomial data, you can use the arcsin transformation, • If E(y)and a single independent variable x are linearly related, and you fit a straight line to the data, then the observed y values should vary in a random manner about and a plot of the residuals against x will appear as shown in Figure 13.17. • If you had incorrectly used a linear model to fit the data in Example 13.3, the residual plot in Figure 13.18 would show that the unexplained variation exhibits a curved pattern, which suggests that there is a quadratic effect that has not been included in the model. ©1998 Brooks/Cole Publishing/ITP

For the data in Example 13.6, the residuals of a linear regression of salary with years of experience x1 would show one distinct set of positive residuals corresponding to the men and a set of negative residuals corresponding to the women. See Figure 13.19: ©1998 Brooks/Cole Publishing/ITP

13.8 Stepwise Regression Analysis • Try to list all the variables that might affect a college freshman’s GPA: - Grades in high school courses, high school GPA, SAT score, ACT score - Major, number of units carried, number of courses taken - Work schedule, marital status, commute or live on campus • A stepwise regression analysis fits a variety of models to the data, adding and deleting variables as their significance in the presence of the other variables is either significant or nonsignificant, respectively. ©1998 Brooks/Cole Publishing/ITP

Once the program has performed a sufficient number of iterations and no more variables are significant when added to the model, and none of the variables are nonsignificant when removed, the procedure stops. • These programs always fit first-order models and are not helpful in detecting curvature or interaction in the data. ©1998 Brooks/Cole Publishing/ITP

13.9 Misinterpreting a Regression Analysis • A second-order model in the variables might provide a very good fit to the data when a first-order model appears to be completely useless in describing the response variable y. • Causality Be careful not to deduce a causal relationship between a response y and a variable x. • Multicollinearity Neither the size of a regression coefficient nor its t-value indicates the importance of the variable as a contributor of information. This may be because two or more of the predictor variables are highly correlated with one another; this is called multicollinearity. ©1998 Brooks/Cole Publishing/ITP

Multicollinearity can have these effects on the analysis: - The estimated regression coefficients will have large standard errors, causing imprecision in confidence and prediction intervals. - Adding or deleting a predictor variable may cause significant changes in the values of the other regression coefficients. • How can you tell whether a regression analysis exhibits multicollinearity? - The value of R2 is large, indicating a good fit, but the individual t-tests are nonsignificant. - The signs of the regression coefficients are contrary to what you would intuitively expect the contributions of those variables to be. - A matrix of correlations, generated by the computer, shows you which predictor variables are highly correlated with each other and with the response y. ©1998 Brooks/Cole Publishing/ITP

- Consider Figure 13.20, the matrix of correlations generated for the real estate data from Example 13.2. The last three columns of the matrix show significant correlations between all but one pair of predictor variables: ©1998 Brooks/Cole Publishing/ITP

13.10 Steps To Follow When Building a Multiple Regression Model • A step-by-step procedure for developing a multiple regression model is as follows: 1. Select the predictor variables to be included in the model. 2. Write a model using the selected predictor variables. 3. Use the analysis of variance F and R2 to determine how well the model fits the data. 4. Check the t tests for the partial regression coefficients to see which ones are contributing significant information in the presence of the others. 5. Use computer-generated residual plots to check for violation of the regression assumptions. ©1998 Brooks/Cole Publishing/ITP

Key Concepts and Formulas I. The General Linear Model 1. 2. The random error e has a normal distribution with mean 0 and variance s 2. II. Method of Least Squares 1. Estimates b0, b1, …, bk for b0, b1, …, bk , are chosen to minimize SSE,the sum of squared deviations about the regression line 2. Least-squares estimates are produced by computer. ©1998 Brooks/Cole Publishing/ITP

III. Analysis of Variance 1. Total SS = SSR + SSE, where Total SS =Syy. The ANOVA table is produced by computer. 2. Best estimate of s 2 is IV. Testing, Estimation, and Prediction 1. A test for the significance of the regression, H0 : b1=b2= ¼=bk= 0, can be implemented using the analysis of variance F test: ©1998 Brooks/Cole Publishing/ITP

2. The strength of the relationship between x and y can be measured using which gets closer to 1 as the relationship gets stronger. 3. Use residual plots to check for nonnormality, inequality of variances, and an incorrectly fit model. 4. Significance tests for the partial regression coefficients can be performed using the Student’s t test: with error df = n-k- 1 5. Confidence intervals can be generated by computer to estimate the average value of y, E(y), for given values of x1, x2, …, xk. Computer-generated prediction intervals can be used to predict a particular observation y for given value of x1, x2, …, xk. For given x1, x2, …, xk, prediction intervals are always wider than confidence intervals. ©1998 Brooks/Cole Publishing/ITP

V. Model Building 1. The number of terms in a regression model cannot exceed the number of observations in the data set and should be considerably less! 2. To account for a curvilinear effect in a quantitative variable, use a second-order polynomial model. For a cubic effect, use a third-order polynomial model. 3. To add a qualitative variable with k categories, use (k - 1) dummy or indicator variables. 4. There may be interactions between two qualitative variables or between a quantitative and a qualitative variable. Interaction terms are entered as bxixj . 5. Compare models using R2(adj). ©1998 Brooks/Cole Publishing/ITP

Chapter 13 Multiple Regression Analysis

Chapter 13 Multiple Regression Analysis

Presentation Transcript

Chapter 13 Introduction to Multiple Regression

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Chapter 13 Multiple Regression

Chapter 13: Multiple Regression

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Chapter 13 Multiple Regression

Chapter 13 Multiple Regression

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

Multiple Regression Analysis

MULTIPLE REGRESSION ANALYSIS