Multiple Regression

Multiple Regression PSY 4603 Research Methods Overview of the Analysis

Multiple Regression • Often, a researcher selects more than one independent variable to be included in a study, assuming that more than one variable affects change in another variable. • With the addition of variables, the model becomes a multiple regression model. • Its generalized form is Y‘ = a + b1X1 + b2X2+ …+ bkXk+ e where k = (1, 2,. . . , k) and k = number of independent variables.

The objectives of MLR are essentially the same as SLR. • The sample data is used to estimate a, b, and the error in the model. • Second, we want to know the proportion of variance accounted for in the model. • Third, we wish to test the statistical hypotheses concerning the sample estimates. • The major difference between the MLR model and the SLR one is the former's determination of the relative contribution of each X in the model in explaining Y. • When the actual computations are completed, we refer to values of beta () instead of values of b. • Beta values ()are the standardized regression coefficients obtained by standardizing the data before the calculations, or standardizing the b values after the analyses.

The assumptions of the model are essentially the same as those of SLR, with the emphasis on (1) estimating the linearity of the data and (2) the effects of intercorrelation in the independent variables and the residuals. University students often complain that universities reward professors for research but not for teaching, and argue that professors react to this situation by devoting more time and energy to the publication of their findings and less time and energy to classroom activities. Professors counter that research and teaching go hand in hand; more research makes better teachers. A student organization at Palm Beach Atlantic University decided to investigate the issue. They randomly selected 50 psychology professors (let's pretend) employed by PBA. The students recorded the salaries of the professors, their average teaching evaluations (on a 10-point scale), and the total number of journal articles published in their careers. These data are stored in columns 1 to 3, respectively, in file . Perform a complete analysis (produce the regression equation, assess it, and diagnose it) and report your findings.

The variables that we’ve isolated as influencing the salaries: • X1 (Course Evaluations) and X2 (Number of Publications). Basically, the partitioning of variance is used to obtain the estimates of the components of the regression equation.

The unstandardized regression equation is shown here: Y’ = 47.768 + 0.776(X1) + 1.062(X2)

Now we must determine the fit of the model by estimating R2 and statistically testing all the estimates. • The squared multiple correlation coefficient is analogous to r2 the term used in SLR. • This value tells us how much of the variance accounted for in Y is attributable to the independent variables. • In our example, R2 is 0.721. • This indicates that a meaningful 72.1% of the variance in the model is explained – 72.1% of the variance in salaries can be attributed to teaching performance and publication record.

p > .05 (N.S.) p < .05 • The remaining step in the evaluation of the regression equation is to estimate the contribution of each variable in the study. • If all the b values are significant, the easiest way to interpret the contribution of each variable is to standardize b. • The standardized b (beta; ) values can then be used to interpret the relative contribution of each variable. • As simple as this may sound, the interpretation of the relative importance of independent (predictor) variables is a very controversial topic. • One never interprets the relative importance of the variables from the unstandardized b values.

Analyzing a Multiple Regression Model • Step 1 Hypothesize the deterministic component of the model. This component relates the mean. E(y). to the independent variables x1, x2 …, Xk. This involves the choice of the independent variables to be included in the model. • Step 2Use the sample data to estimate the unknown model parameters 0, 1, 2,…,k in the model. • Step 3 Specify the probability distribution of the random error term, , and estimate the standard deviation of this distribution, .

Step 4Check that the assumptions on  are satisfied, and make model modifications if necessary. • Step 5 Statistically evaluate the usefulness of the model. • Step 6 When satisfied that the model is useful, use it for prediction, estimation, and other purposes.

Regression analysis is often used in medical research to examine the variables that affect various biological processes. A study performed by medical scientists investigated nutritional effects on preweaning mouse pups. In the experiment, the amount of nutrients was varied by rearing the pups in different litter sizes. After 32 days, the body weight and brain weight (both measured in grams) were recorded. These data are stored in the file (column 1 = brain weight; column 2 = litter size; column 3 = body weight). • a. Conduct a multiple regression analysis where the dependent variable is the brain weight. Interpret the coefficients. b. Can we infer at the 5% significance level that there is a linear relationship between litter size and brain weight? c. Can we infer at the 5% significance level that there is a linear relationship between body weight and brain weight? d. What is the coefficient of determination, and what does it tell you about this model? e. Test the overall validity of the model. (Use a 5% significance level.) f. Predict with 95% confidence the brain weight of a mouse pup that came from a litter of 10 pups and whose body weight is 8 grams. g. Predict with 95% confidence the brain weight of a mouse pup that came from a litter of 6 pups and whose body weight is 7 grams. • Source: D. E. Matthews and V. T. Farewell, Using and Understanding Medical Statistics (Karger, 1988).

b. Can we infer at the 5% significance level that there is a linear relationship between litter size and brain weight?Yes (r = -.955) • c. Can we infer at the 5% significance level that there is a linear relationship between body weight and brain weight?Yes (r = .746)

d. What is the coefficient of determination, and what does it tell you about this model? R2 = .651. The model suggests that 65.1% of the variance in brain weight can be accounted for by the linear combination of litter size and body weight. e. Test the overall validity of the model. (Use a 5% significance level.)The model is statistically significant at an alpha level of .05

Model: Y = .178 + .007(X1) + .024(X2) • f. Predict with 95% confidence the brain weight of a mouse pup that came from a litter of 6 pups and whose body weight is 7 grams. Y = .178 + .007(10) + .024(8) = .178 + .07+ .192 = .440grams • g. Predict with 95% confidence the brain weight of a mouse pup that came from a litter of 6 pups and whose body weight is 7 grams. Y = .178 + .007(6) + .024(7) = .178 + .042 + .168 = .388grams

An Example… • The owner of an apartment building in Minneapolis believed that her property tax bill was too high because of an over-assessment of the property's value by the city tax assessor. The owner hired an independent real estate appraiser to investigate the appropriateness of the city's assessment. The appraiser used regression analysis to explore the relationship between the sale prices of apartment buildings sold in Minneapolis and various characteristics of the properties. Twenty‑five apartment buildings were randomly sampled from all apartment buildings that were sold during a recent year. The SPSS datafile lists the data collected by the appraiser.

The real estate appraiser hypothesized that the sale price (that is, market value) of an apartment building is related to the other variables in the table according to the model y = 0 + 1X1 + 2X2 + 3X3 + 4X4 + 5X5 +  where y = Sale price (dollars) x1 = Number of apartments x2 = Age of structure x3 = Lot Size (square feet) x4 = Number of on-site parking spaces x5 = Gross building area

Significant! • The model hypothesized above is fit to the data. You can see that 0 (the constant or • y-intercept = 92787.869 and, • 1 = 4140.419, 2 = -853.184, 3 = .962, . 4 = 2695.869, and 5 = 15.544 • Therefore, the equation that minimizes SSE for this data set (i.e., the least squares • prediction equation) is • y = 92787.869+ 4140.42(x1) – 853.18(x2) + .962(x3) + 2695.87(X4) + 15.54(X5)

Two final considerations in multiple regression analysis are the selection of independent variables and the use of partialing in the regression analysis. • In psychological research, many independent variables are initially selected for analysis. How can one determine which variables or sets of variables are best to use? • The key is to find the minimum number of variables needed to account for almost the same amount of variance as is accounted for by the entire set. • Various procedures are available to accomplish this; each has its advantages and disadvantages. • One such analysis is called stepwise multiple regression. • The stepwise procedure inserts variables into the analysis until the regression equation is satisfactory.

The purpose of Stepwise regression is to explain the most variance (R2) with the most important variables.

Essentially, what this stepwise forward regression does is to enter independent variables as long as individual F-statistics don’t exceed  =.05 (i.e., p < .05). • The F-statistics for three variables – Expected Grade, (the most important;  = .521), Teaching Skills (.322), and Knowledge of the Course Material ( = -.262), -- thus significantly contribute to variance in the Instructor’s Evaluation (R2 = 0.886). • Naturally, there is not a ton of difference between this analysis and a regular (simultaneous) regression, but with a larger number of variables, the benefits are enormous.

Sometimes a researcher uses one or more categorical variables in a multiple regression. • These are called dummy variables; they usually present few problems to the experienced user. • However, the analyst should be aware that dummy variables are very difficult if not impossible to interpret if they have more than four categories. • Finally, partialing consists of using partial and semipartial correlations in the analysis for statistical control. • Sometimes, when a researcher is studying a relationship among variables, there are confounding effects that hamper the actual mathematical determination of the relationship. • These effects can be controlled in an analysis by the method of partialing (you’ll see this in grad school).

Example APA-Style Results Sections

Multiple Regression