290 likes | 560 Vues
By Dr. Olga Korosteleva. Outline of Presentation. Basics of Simple Linear Regression Model Formula for Multiple Linear Regression Model P olynomial terms, interaction terms Dummy variables Model accuracy Model adequacy Interpretation of coefficients Applications in SAS and SPSS.
E N D
Outline of Presentation • Basics of Simple Linear Regression Model • Formula for Multiple Linear Regression Model • Polynomial terms, interaction terms • Dummy variables • Model accuracy • Model adequacy • Interpretation of coefficients • Applications in SAS and SPSS
Simple Linear Regression Model Suppose a pair of continuous variables is observed on some individuals. A plot of against may look something like this:
What Is This Relation? We would describe this relation as linear because there is a general linear trend. Each observed value of deviates randomly from this line.
How to Write This Relation? • The formula for a simple linear regression is where is the intercept, is the slope, and is a random error. • The regression coefficients and are unknown and have to be estimated from the observed data. • Their estimates are denoted by and .
What Is Fitted Line? The equation of the fitted line is where is a fitted value.
What Is Multiple Linear Regression Model? The equation of the multiple linear regression model is where tuples are observed, are regression coefficients, and is a random error. Multiple linear regression is sometimes called multivariate or multivariable.
What Are Assumptions of This Model? Random errors ‘s are • normally distributed (normality assumption) • have mean zero • have constant variance (homoscedasticity assumption)
What Is Unknown In This Model? The unknowns of the model are the regression coefficients , and the variance . The estimated regression coefficients define afitted plane
General Form of Multiple Regression The variables don’t have to be independent. They may be • Polynomial terms: • Interaction terms: Then the fitted surface is not necessary a plane.
Why Would Multiple Linear Regression of the Form Be Called Linear???!!! Because it is linear in beta coefficients.
What Are Dummy Variables? Suppose variable is not continuous but is categorical, that is, it assumes only a few values, say, . Then dummy (or indicator) variables have to be introduced into the model:
Example Gender (F/M) is a dummy variable already because it assumes only two values, e.g., for female, and for male. The regression model is: For female, For male, The level for which the dummy variable was not assigned in termed the reference category.
Another Example Ethnicity (White/Hispanic/Black/Asian/Other) has five levels, therefore, four dummy variables will be introduced into the model which will become: For White, For Hispanic, For Black, For Asian, For Other, (reference)
A Side Note Depending on the field of studies, • variable is called predictor, or stressor, or covariate, or factor, or independent variable, or explanatory variable. • variable is called response, or outcome, or dependent variable, or variate.
How to Check Model Accuracy? • The coefficient of multiple determination is computed as the ratio of variation in accounted for by the regression model and the total variation in . • is therefore interpreted as the proportion of total variation of that can be explained by the regression. • is a number between 0 and 1. • The larger , the better fit the model has. If , the model has a perfect fit.
How to Check Model Adequacy? • Define a residual as the difference between the observed and fitted , that is, . • Residuals are interpreted as estimates of random errors ‘s. • To check that model assumptions (normality, mean zero, homoscedasticity) are met, perform a residual analysis (also called model validation, or model validity check, or model adequacy check, or model diagnostics).
What Is Residual Analysis? • Run normality tests. All or almost all of them should have P-value > 0.05. • Plot histogramof residuals. A bell-shaped curve centered around zero should be displayed.
What Is Residual Analysis? • Construct normal probability plot (also called quantile-quantileplot or qqplot) of residuals. It is a plot of residuals against corresponding quantiles of standard normal distribution. • A linear pattern should be displayed except possibly at the end-points.
What Is Residual Analysis? • Plot residuals against fitted values to check for homoscedasticity of errors. There should be no discernible pattern such as a megaphone or runs of positives and negatives. no pattern megaphone runs of negatives and positives
How to Interpret Beta Coefficients? • Since the mean of is zero, the mean of may be computed as .Here stands for expected value. • If is a continuous variable, then is interpreted as the estimated change in mean response for a unit increase in , provided all the other variables stay fixed. Indeed,
Example Suppose reduction in blood pressure is regressed on several predictors including age. The regression coefficient for age is estimated as -0.04. This means that the estimated average reduction in blood pressure decreases by 0.04 units if patient becomes one year older, controlling for all other predictor variables.
How to Interpret Beta Coefficients? • If is a categorical variable with two levels (its own dummy variable), then is interpreted as the estimated difference in mean response for the upper level of (when ) and that for the lower level (when ). To see that, write
Example Suppose reduction in blood pressure is regressed on several predictors including gender (F/M). The regression coefficient for female is estimated as 1.71. This means that the estimated average reduction in blood pressure for females is 1.71 units higher than that for males, provided all the other variables are the same (same age, ethnicity, etc.)
How to Interpret Beta Coefficients? • If is a categorical variable with levels, then dummy variables are included in the model that correspond to . The th level is the reference. Then is interpreted as the estimated difference in mean response for the individuals with and that for reference individuals. To see that, write
Example Suppose reduction in blood pressure is regressed on several predictors including ethnicity (White/ Hispanic/Black/Asian/Other). Four dummy variables are entered into the model: White, Hispanic, Black, and Asian. “Other” is a reference category. The regression coefficient for White is estimated as 5.95. This means that the estimated average reduction in blood pressure for Whites is 5.95 units higher than that for Others, provided all the other variables are the same (same age, gender, etc.)
A Side Note • If is continuous, we typically talk about the same individual when interpreting : “age of a patient increases by one year”. • If is a categorical variable, we typically compare two individuals: “female v. male,” “White v. Other”.
SAS and SPSS Applications A study is conducted to test the efficacy of a new treatment in lowering high blood pressure. Patient ID, Group (Cx/Tx), age (in years), gender (F/M), ethnicity(White/Hisp/Black/Asian/Other) and reduction in blood pressure (in mmHg) are recorded on 28 patients. The SAS code is here. The relevant SAS output is here. IBM SPSS 20 data file is here.