Multiple Regression Models

Multiple Regression Models

Exploring an example: “Chapter 4: Multiple Regression II” data • Online stock trading through the Internet has increased dramatically during the past several years. An article discussing this new method of investing provided data on the major Internet stock brokerages who provide this service. Here we have some data for the top 10 Internet brokerages. The variables are Mshare, the market share of the firm; Accts, the number of Internet accounts in thousands; and Assets, the total assets in billions of dollars. • Describe the data: • How many variables does the data set contain? How would you describe them in terms of levels of measurement?

Explaining Assets with each predictor variable • Find the correlation between Assets, and the explanatory variables Mshare and Accts. • Use a Simple Linear Regression to predict Assets content using the number of accounts. • What is the regression equation? • What are the results of the significance test for the regression coefficient? • Do the same using Mshare.

What is Multiple Regression? • Predicting an outcome (dependent variable) based upon several independent variables simultaneously. • Why is this important? • Behavior is rarely a function of just one variable, but is instead influenced by many variables. So the idea is that we should be able to obtain a more accurate predicted score if using multiple variables to predict our outcome.

Strategy for Multiple Regression

The Multiple Linear Regression Model • Regression applications in which there are several independent variables, x1, x2, … , xk . A multiple linear regression model with p independent variables has the equation • βiis the intercept and βi determines the contribution of the independent variable xi • The ε is a random variable with mean 0 and variance σ2.

The Prediction Equation • The equation for this model fitted to data is • Where denotes the “predicted” value computed from the equation, and bi denotes an estimate of βi. • As with Simple Linear Regression, they’re obtained by the method of least squares • Among the set of all possible values for the parameter estimates, I find the ones which minimize the sum of squared residuals.

Basic Idea • With multiple regression, we form a 'linear combination' of multiple variables to best predict an outcome, and then we assess the contribution that each predictor variable makes to the equation. • My research question might be: • “How much does an independent variable contribute to explaining dependent variable after the effect of another independent variable is taken into account?”

Doing the Calculations • Computation of the estimates by hand is tedious. • They are ordinarily obtained using a regression computer program. • Standard errors also are usually part of output from a regression program.

Let’s Return to the Example • Construct a 3-D plot. • Come up with a prediction equation for the multiple regression model.

Assessing the Utility of the Model: Hypothesis tests (see MLR handout) • Test if all of the slope parameters are zero: F –test. • Test if a particular slope parameter is zero given that all other x's remain in the model: t –test.

ANOVA: ANalysis Of VAriance • This is a test of the null hypothesis that Multiple R in the population = 0.0. If this is .05 or less, reject the null hypothesis. • For a multiple linear regression model with p independent variables fitted to a data set with n observations is, the ANOVA is: Source of Variation DF SS MS Model p SSM MSM Error n-p-1 SSE MSE Total n-1 SST

Sums of squares • The sums of squares SSM, SSE, and SST have the same definitions in relation to the model as in simple linear regression: M

SST = SSM + SSE • The value of SST does not change with the model. • It depends only on the values of the dependent variable y. • SSE decreases as variables are added to a model, and SSM increases by the same amount. • This amount of increase in SSM is the amount of variation due to variables in the larger model that was not accounted for by variables in the smaller model.

F statistic F is the statistic to test if ALL the slope parameters are zero. • ANOVA gives F statistic and p-value (be sure to set the α level) • Under the null hypothesis the F statistic has an F(p, n-p-1) distribution and the p-value is ___. According to this distribution, the chance of obtaining an F statistic of __ or larger is _(p-value). We conclude that the model is useful/not useful for predicting…

Proceed only if F and corresponding p-value indicate sufficient evidence that the overall model is useful • If so, look to the individual variables to determine their contribution • We do this with t-tests • p = .05 or less than each variable indicates a significant contribution

Interpreting coefficients • Constant = slope • Other coefficients are the regression coefficients, interpreted as the change in the mean dependent variable for each unit change in the corresponding independent variable, all other variables held constant.

Confidence Intervals • Use • bjis the least-squares estimate of • t* is the (1-C)/2 critical value from the t(n-p-1) distribution.

Returning to our example… • How good is the model? • Which variables contribute to the model?

What if the Relationship is Curvilinear? • Example: Application journal for chapter 4 (data- Chapter 4: Curvilinear Relationship) • Explore the relationship between IgG (y) as a function of maximal oxygen uptake (x). • Does a linear or curvilinear model better explain the variation in IgG? How do you determine this?

Basic Quadratic Model E(y) = β0 + β1x + β2x2 • β0 is the y-intercept of the curve; value of E(y) when x = 0 • β1 is the shift parameter; changing the value of β1 shifts the parabola to the right (if increased) or left (with decrease) • β2 is the rate of curvature

Interpreting the Coefficient (β) Estimates • Estimate of β0 can only be meaningfully interpreted if the sampled range of the independent variable includes zero. • The estimated coefficient of the first-order terms no longer represent the slope and cannot typically be meaningfully interpreted. • The sign of the coefficient associated with the quadratic term (x2) indicates if curve is • concave downward (mound-shaped): - • concave upward (bowl-shaped): + • What is the prediction equation, and how would you interpret the βs for the example?

Assessing Model Utility • Again, refer to the F test statistic and associate p-value. • If these indicate that the model is useful, proceed to the t-test of the βassociated with the quadratic term (x2)- β2 here • H0: β2 = 0 (no curvature in response curve) • Ha: β2< 0 (downward concavity exists) Or • Ha: β2> 0 (upward concavity exists) • This is a one-tailed test, so we divide the associated p-value by 2. • We do not need to consider the test statistics for the coefficients associated with the y-intercept and first-order term(s)

What if I have a Qualitative Independent Variable? • Create a “dummy” variable (indicator variable.) • Instructions included on Minitab worksheet. • Example: Application journal # 3 (data- Chapter 4: Dummy Variable) • Create a dummy variable for repellent type • Is repellent type useful for predicting cost per use? Number of hours of protection?

What if the relationship between E(y) and any one IV depends on the value of another IV? • In this case, the two independent variables interact, and we model this a cross-product of the IVs.

Example: Graph and interpret the following findings Let’s say we want to study how hard students work on tests. We have some achievement-oriented students and some achievement-avoiders. We create two random halves in each sample, and give half of each sample a challenging test, the other an easy test. We measure how hard the students work on the test. The means of this study are:

Caution! • Once an interaction has been deemed important in a model, all associated first-order terms should be kept in the model, regardless of the magnitude of their p-values.

Conclusions E(y)= β0 + β1x1 + β2x2+ β3x1x2 The effect of test difficulty (x1) on effort (y) depends on a student’s achievement orientation (x2). Thus, the type of achievement orientation and test difficulty interact in their effect on effort. This is an example of a two-way interaction between achievement orientation and test difficulty.

Multiple Regression Models