Chapter 14

Chapter 14 Multiple Regression Models

Multiple Regression Models • A general additive multiple regression model, which relates a dependent variable y to k predictor variables x1, x2,…, xk is given by the model equation • y = a + b1x1 + b2x2 + … + bkxk + e • The random deviation e is assumed to be normally distributed with mean value 0 and variance s2 for any particular values of x1, x2,…, xk. This implies that for fixed x1, x2,…, xk values, y has a normal distribution with variance s2 and • (mean y value for • fixed x1, x2,…, xk values) = a + b1x1 + b2x2 + … + bkxk

Multiple Regression Models • The bi’s are called population regression coefficients; each bi can be interpreted as the true average change in y when the predictor xi increases by 1 unit and the values of all the other predictors remain fixed. • The deterministic portion a + b1x1 + b2x2 + … + bkxk is called the population regression function.

Polynomial Regression Models • The kth degree polynomial regression model • y = a + b1x + b2x2 + … + bkxk + e • Is a special case of the general multiple regression model with x1 = x, x2 = x2, … , xk = xk. • The population regression function (mean value of y for fixed values of the predictors) is a+ b1x + b2x2 + … + bkxk . The most important special case other than simple linear regression (k = 1) is the quadratic regression model y = a+ b1x + b2x2. This model replaces the line y = a+ bx with a parabolic cure of mean values a+ b1x + b2x2. If b2 > 0, the curve opens upward, whereas if b2 < 0, the curve opens downward.

Interaction • If the change in the mean y value associated with a 1-unit increase in one independent variable depends on the value of a second independent variable, there is interaction between these two variables. When the variables are denoted by x1 and x2, such interaction can be modeled by including x1x2, the product of the variables that interact, as a predictor variable.

Qualitative Predictor Variables. • Up to now, we have only considered the inclusion of quantitative (numerical) predictor variables in a multiple regression model. • Two types are very common: • Dichotomous variable: One with just two possible categories coded 0 and 1 Example • Gender {male, female} • Marriage status {married, not-married} • Ordinal variables: Categorical variables that have a natural ordering • Activity level {light, moderate, heavy} coded respectively as 1, 2 and 3 • Education level {none, elementary, secondary, college, graduate} coded respectively 1, 2, 3, 4, 5 (or for that matter any 5 consecutive integers}

Least Square Estimates • According to the principle of least squares, the fit of a particular estimated regression function • a + b1x1 + b2x2 + … + bkxk to the observed data is measured by the sum of squared deviations between the observed y values and the y values predicted by the estimated function: • S[y –(a + b1x1 + b2x2 + … + bkxk )]2 • The least squares estimates of a, b1, b2,…, bk are those values of a, b1, b2, … , bk that make this sum of squared deviations as small as possible.

Predicted Values & Residuals

Sums of Squares

Estimate for s2

Coefficient of Multiple Determination, R2

Adjusted R2 Generally, a model with large R2 and small se are desirable. If a large number of variables (relative to the number of data points) is used those conditions may be satisfied but the model will be unrealistic and difficult to interpret.

F Distributions F distributions are similar to a Chi-Square Distributions, but have two parameters, dfden and dfnum.

The F Test for Model Utility The regression sum of squares denoted by SSReg is defined by SSREG = SSTo - SSresid

The F Test for Model Utility

The F Test Utility of the Model y = a + b1x1 + b2x2 + … + bkxk + e • Null hypothesis: • H0: b1 = b2 = … = bk =0 • (There is no useful linear relationship between y and any of the predictors.) • Alternate hypothesis: • Ha: At least one among b1, b2, … , bk is not zero • (There is a useful linear relationship between y and at least one of the predictors.)

The F Test Utility of the Model y = a + b1x1 + b2x2 + … + bkxk + e

The F Test Utility of the Model y = a + b1x1 + b2x2 + … + bkxk + e The test is upper-tailed, and the information in the Table of Values that capture specified upper-tail F curve areas is used to obtain a bound or bounds on the P-value using numerator df = k and denominator df=n-(k+1). Assumptions: For any particular combination of predictor variable values, the distribution of e, the random deviation, is normal with mean 0 and constant variance.

An Example During a summer NSF program for teachers of statistics, the participants were asked to break into groups and develop a project similar in scope to what we would like to have our students develop. One of these groups decided that it would study lung capacity of adult humans measured in liters. To measure the capacities of a sample of adults (the sample was not particularly easy to obtain on the campus during the summer so we “shanghaied” everyone that was willing to stand still, be measured and interviewed. We used borrowed (antique liquid displacement apparatus) equipment and collected data.

An Example This group recorded a number of variables including gender (m or f), age (yrs), height (in), weight (lbs), waist (in), chest girth (in), smoking (Y or N), activity level (1 - light, 1 - medium, 3 - heavy) along with the lung capacity (liters). The code for the gender is 0 = Female 1 = Male The code for smoking is 0 = No 1 = Yes The data follows on the next slides

An Example - The Data

Analysis - 1st with Minitab Regression Analysis: Capacity versus Age, Height, ... The regression equation is Capacity = - 6.17 - 0.0140 Age + 0.149 Height + 0.00636 Weight - 0.0087 Chest - 0.0220 Waist + 0.343 Activity - 0.109 Smoke - 0.409 Gender 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.172 2.653 -2.33 0.027 Age -0.014032 0.007000 -2.00 0.054 Height 0.14856 0.03503 4.24 0.000 Weight 0.006359 0.006094 1.04 0.305 Chest -0.00867 0.05791 -0.15 0.882 Waist -0.02197 0.04557 -0.48 0.633 Activity 0.3427 0.1282 2.67 0.012 Smoke -0.1092 0.1491 -0.73 0.469 Gender -0.4086 0.2757 -1.48 0.148 S = 0.4607 R-Sq = 84.3% R-Sq(adj) = 80.2%

Analysis - 2nd with Minitab Notice that the P-values on the right suggest that only the predictors height (P-value = 0.000) and activity level (P-value = 0.012) are significant at the 0.05 level of significance. The only other variable that seem possibly significant are age (P-value = 0.054 and gender (P-value =0.148). When stepwise regression techniques are applied using Minitab, the variables that remain significant are height, activity level, age and gender. The output is on the next two slides.

Analysis - 2nd with Minitab Stepwise Regression: Capacity versus Age, Height, ... Alpha-to-Enter: 0.1 Alpha-to-Remove: 0.1 Response is Capacity on 8 predictors, with N = 40 N(cases with missing observations) = 1 N(all cases) = 41 Step 1 2 3 4 Constant -10.251 -9.759 -9.787 -6.929 Height 0.209 0.191 0.198 0.161 T-Value 10.42 9.87 10.43 6.55 P-Value 0.000 0.000 0.000 0.000 Activity 0.35 0.31 0.30 T-Value 2.87 2.60 2.67 P-Value 0.007 0.013 0.011

Analysis - 2nd with Minitab Activity 0.35 0.31 0.30 T-Value 2.87 2.60 2.67 P-Value 0.007 0.013 0.011 Age -0.0109 -0.0137 T-Value -1.96 -2.54 P-Value 0.057 0.016 Gender -0.47 T-Value -2.24 P-Value 0.032 S 0.534 0.490 0.472 0.448 R-Sq 74.06 78.78 80.84 83.23 R-Sq(adj) 73.38 77.63 79.24 81.32 C-p 15.1 7.8 5.8 3.0

Analysis - 2nd with Minitab The resulting Minitab output from the regression analysis using those 4 predictors follows. Regression Analysis: Capacity versus Height, Activity, Gender, Age The regression equation is Capacity = - 6.93 + 0.161 Height + 0.302 Activity - 0.466 Gender - 0.0137 Age 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.929 1.708 -4.06 0.000 Height 0.16079 0.02454 6.55 0.000 Activity 0.3025 0.1133 2.67 0.011 Gender -0.4658 0.2082 -2.24 0.032 Age -0.013744 0.005404 -2.54 0.016 S = 0.4477 R-Sq = 83.2% R-Sq(adj) = 81.3%

Analysis - 2nd with Minitab Consider the following graphs: residuals vs fits and the normal plot of the residual.

Analysis - 2nd with Minitab

Analysis - 2nd with Minitab Notice that both of these graphs appear to indicate that the assumptions made were justifiable. This multilinear model appears to provide a reasonably acceptable model for estimating lung capacity.

Analysis - 3rd with Minitab An number of the members on the project team felt that other variables, specifically height/weight and chest/waist rations as well as the square of the chest girth multiplied by the height might be better predictor variables. When these three combination variables were calculated and added to the height, activity level, age and gender the following Minitab output was obtained.

Analysis - 3rd with Minitab Regression Analysis: Capacity versus Height, Activity, ... The regression equation is Capacity = - 6.22 + 0.160 Height + 0.307 Activity - 0.469 Gender - 0.0150 Age - 1.04 HT/WT + 0.01 CH/Waist -0.000002 c2h 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.220 2.111 -2.95 0.006 Height 0.16012 0.02915 5.49 0.000 Activity 0.3072 0.1211 2.54 0.016 Gender -0.4686 0.2245 -2.09 0.045 Age -0.015039 0.006613 -2.27 0.030 HT/WT -1.042 1.574 -0.66 0.512 CH/Waist 0.011 1.305 0.01 0.993 c2h -0.00000221 0.00000737 -0.30 0.766 S = 0.4635 R-Sq = 83.6% R-Sq(adj) = 80.0%

Analysis - 1st with Minitab None of these three variables appeared to be significant. The fact that the girth2•height which would be proportional (approximately) to the volume of the body came as a surprise to the members of the team. As a side note, the literature on spirography suggests that height is the most significant factor in lung capacity and this was what this particular study indicated after it was completely analyzed.

Chapter 14

Chapter 14

Presentation Transcript

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14.

Chapter 14

Chapter 14

CHAPTER 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14