Chapter 2 - Simple and Multiple Regression

Chapter 2 - Simple and Multiple Regression 2.0 Introduction 2.1 Linear regression model 2.2 Example 2.3 Examining data using SAS 2.4 Multiple regression 2.5 Variable transformation 2.6 SAS PROC REG

2.0 Introduction • Covering a variety of topics about using SAS for regression • Apply your knowledge of regression, combine it with instruction on SAS, to perform, understand and interpret regression analyses. • Simple and multiple regression. • Supporting tasks that are important in preparing to analyze your data, e.g., data checking, getting familiar with your data file, and examining the distribution of your variables.

Introduction (2) • We will illustrate the basics of simple and multiple regression and demonstrate the importance of inspecting, checking and verifying your data before accepting the results of your analysis. • We hope to show that the results of your regression analysis can be misleading without further probing of your data, which could reveal relationships that a casual analysis could overlook. • The data used in this topic is “C:\SASREG\”

2.1 A Linear Regression Model and Analysis Univariate Regression • A classic statistical problem is to try to determine the relationship between two random variables X and Y. For example, we might consider height and weight of a sample of adults. Linear regression attempts to explain this relationship with a straight line fit to the data. The simplest linear regression model postulates that • Y= a+bX+e where Y is the response, X is the predictor factor and e is the "residual“. eis a random variable with mean zero. The coefficients a and b are determined by the condition that the sum of the square residuals is as small as possible.

The Multi-Variate Regression • The data consist of m values, y1, y2,…,yn, of the dependent variable (response variable), y, derived from observations. The dependent variable is subject to error. This error is assumed to be random variable, with a mean of zero. • The independent variable (explanatory variable or predictor) x, is error-free. If this is not so, modeling should be done using errors-in-variables model techniques. • In general there are more than one predictors, say K, then K parameters are to be determined, β1, β2,…, βK . The model is a linear combination of these parameters: Yi= β0 + β1 X1i+ β2 X2i +… βK Xki+ei • The elements of the matrix X, that is Xij, are constants or functions of the independent variable, xi and is an observational error. Models which do not conform to this specification must be treated by nonlinear regression

Model Fitting The first objective of regression analysis is to best-fit the data by adjusting the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. The objective function, S, is defined as a sum of squared residuals, ri S=∑ ri2, where each residual is the difference between the observed value and the value calculated by the model: ri =yi-∑ βj Xji The best fit is obtained when S, the sum of squared residuals, is minimized. Subject to certain conditions, the parameters then have minimum variance (Gauss–Markov theorem) and may also represent a maximum likelihood solution to the optimization problem. In matrix notation, these equations are written as And thus when the matrix X′X is non singular. Details omitted.

2.2 Example • Let's dive right in and perform a regression analysis using the a data of datamath which has variables math00, size_k3, meals and F_Teach. • These measure the academic performance of the school (math00), the average class size in kindergarten through 3rd grade (size_k3), the percentage of students receiving free meals (meals) - which is an indicator of poverty, and the percentage of teachers who have F_Teach teaching credentials (F_Teach). • We expect that better academic performance would be associated with lower class size, fewer students receiving free meals, and a higher percentage of teachers having F_Teach teaching credentials. • Below, the proc reg for running this regression model followed by the SAS output.

PROC REG • It is a general-purpose procedure for regression, which performs linear regression with many diagnostic capabilities. • It selects models using one of nine methods, produces scatter plots of raw data and statistics, highlights scatter plots to identify particular observations, and allows interactive changes in both the regression model and the data used to fit the model

Functions • multiple MODEL statements • nine model-selection methods (F_Teach, backward, forward,stepwise,…) • interactive changes both in the model and the data used to fit the model (delete, add, etc) • linear equality restrictions on parameters • tests of linear hypotheses and multivariate hypotheses • collinearity diagnostics (tol vif collin ) • predicted values, residuals, studentized residuals, confidence limits, and influence statistics • correlation or crossproduct input • requested statistics available for output through output data sets • plots • - • plot model fit summary statistics and diagnostic statistics • - • produce normal quantile-quantile (Q-Q) and probability-probability (P-P) plots for statistics such as residuals • - • specify special shorthand options to plot ridge traces, confidence intervals, and prediction intervals • - • display the fitted model equation, summary statistics, and reference lines on the plot • - • control the graphics appearance with PLOT statement options and with global graphics statements including the TITLE, FOOTNOTE, NOTE, SYMBOL, and LEGEND statements • - • "paint" or highlight line-printer scatter plots • - • produce partial regression leverage line-printer plots

Example 1:SAS Program • proc reg data="c:\sasreg\ datamath "; model math00 = size_k3 meals F_Teach; • run; • Where reg is a procedure in SAS

SAS OUTPUT The SAS System The REG ProcedureModel: MODEL1 Dependent Variable: math00 math 2000 Number of Observations Read 400 Number of Observations Used 313 Number of Observations with Missing Values 87

Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 3 2634884 878295 213.41 <.0001 Error 309 1271713 4115.57673 Corrected Total 312 3906597 Root MSE 64.15276 R-Square 0.6745 Dependent Mean 596.40575 Adj R-Sq 0.6713 Coeff Var 10.75656 Adj_R2:1 - ([(n-1)/(n-k)]) (1- R2)

Parameter Estimates Variable Label DF ParameterEstimate StandardError t Value Pr > |t| Intercept Intercept 1 906.73916 28.26505 32.08 <.0001 size_k3 avg class size k-3 1 -2.68151 1.39399 -1.92 0.0553 meals pct free meals 1 -3.70242 0.15403 -24.04 <.0001 F_Teach pct F_Teach credential 1 0.10861 0.09072 1.20 0.2321

Observations 1.The average class size (size_k3, b=-2.68), is not significant (p=0.0553), but only just so, and the coefficient is negative which would indicate that larger class sizes is related to lower academic performance -- which is what we would expect. 2.The effect of meals (b=-3.70, p<.0001) is significant and its coefficient is negative indicating that the greater the proportion students receiving free meals, the lower the academic performance. Please note, that we are not saying that free meals are causing lower academic performance. The meals variable is highly related to income level and functions more as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance. This result also makes sense. 3.The percentage of teachers with F_Teach credentials (F_Teach, b=0.11, p=.2321) seems to be unrelated to academic performance. This would seem to indicate that the percentage of teachers with F_Teach credentials is not an important factor in predicting academic performance -- this result was somewhat unexpected.

2.3 Examining datausing SAS proc contents can be used to examine the data file such as: how many observations it has and the names of the variables it contains. proc contents data="c:\sasreg\datamath" ; run;

The CONTENTS Procedure

Reading The DATA • There are 400 observations and 21 variables. The variables included academic performance in 2000 and 1999 and the change in performance, math00, math99 and growth respectively. • Also included various characteristics of the schools, e.g., class size, parents education, percent of teachers with F_Teach and emergency credentials, and number of students. • Note that when we did our original regression analysis it said that there were 313 observations, but the proc contents output indicates that we have 400 observations in the data file. • If want to learn more about the data file, use proc print to show some of the observations. For example, below we proc print to show the first five observations.

Using Proc Print proc print data="c:\sasreg\datamath"(obs=5) ; run; --Print first 5 observations for all 21 records. proc print data="c:\sasreg\datamath"(obs=10); var math00 size_k3 meals F_Teach; run; --Print first 10 observations for 4 selected records.

The Data used in the Regression Obs math00 size_k3 meals F_Teach 1 693 16 67 76 2 570 15 92 79 3 546 17 97 68 4 571 20 90 87 5 478 18 89 87 6 858 20 . 100 7 918 19 . 100 8 831 20 . 96 9 860 20 . 100 10 737 21 29 96

Observations There are 4 missing meals among the 10 records. This is probably the reason, only 313 records in the data were used in the regression modeling.

PROC MEANS • The MEANS procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations. For example, PROC MEANS • calculates descriptive statistics based on moments • estimates quantiles, which includes the median • calculates confidence limits for the mean • identifies extreme values • performs a t test.

Proc MEANS program Here use proc means to learn more about the variables math00, size_k3, meals, and F_Teach. proc means data="c:\sasreg\datamath"; var math00 size_k3 meals F_Teach; run;

SAS OUTPUT

Observations math00 scores don't have any missing values (because the N is 400) and the scores range from 369-940. This makes sense since the math scores can range from 200 to 1000. Average class size (size_k3) had 398 valid values ranging from -21 to 25 and 2 are missing. It seems odd for a class size to be -21. The percent receiving free meals (meals) ranges from 6 to 100, but there are only 315 valid values (85 are missing). This seems like a large number of missing values. The percent with F_Teach credentials (F_Teach) ranges from .42 to 100 with no missing.

PORC FREQ • The FREQ procedure produces one-way to n-way frequency and crosstabulation (contingency) tables. For two-way tables, PROC FREQ computes tests and measures of association. For n-way tables, PROC FREQ does stratified analysis, computing statistics within, as well as across, strata. Frequencies and statistics can also be output to SAS data sets. • For one-way frequency tables, PROC FREQ can compute statistics to test for equal proportions, specified proportions, or the binomial proportion. For contingency tables, PROC FREQ can compute various statistics to examine the relationships between two classification variables adjusting for any stratification variables. PROC FREQ automatically displays the output in a report and can also save the output in a SAS data set. • For some pairs of variables, you may want to examine the existence or the strength of any association between the variables. To determine if an association exists, chi-square tests are computed. To estimate the strength of an association, PROC FREQ computes measures of association that tend to be close to zero when there is no association and close to the maximum (or minimum) value when there is perfect association. The statistics for contingency tables include • chi-square tests and measures • measures of association • risks (binomial proportions) and risk differences for 2×2 tables • odds ratios and relative risks for 2×2 tables • tests for trend • tests and measures of agreement • Cochran-Mantel-Haenszel statistics

Freq Program Using proc freq to learn more about any categorical variables, such as yr_rnd, as shown below. proc freq data="c:\sasreg\datamath"; tables yr_rnd; run;

PROC UNIVARIATE • The UNIVARIATE procedure provides the following: • descriptive statistics based on moments (including skewness and kurtosis), quantiles or percentiles (such as the median), frequency tables, and extreme values • histograms and comparative histograms. Optionally, these can be fitted with probability density curves for various distributions and with kernel density estimates. • quantile-quantile plots (Q-Q plots) and probability plots. These plots facilitate the comparison of a data distribution with various theoretical distributions. • goodness-of-fit tests for a variety of distributions including the normal • the ability to inset summary statistics on plots produced on a graphics device • the ability to analyze data sets with a frequency variable • the ability to create output data sets containing summary statistics, histogram intervals, and parameters of fitted curves

Study Data Distribution • The variable yr_rnd is coded 0=No (not year round) and 1=Yes (year round). Of the 400 schools, 308 are non-year round and 92 are year round, and none are missing. • The above commands have uncovered a number of peculiarities worthy of further examination. For example, let us look further into the average class size by getting more detailed summary statistics for size_k3 using proc univariate. • proc univariate data="c:\sasreg\datamath"; var size_k3; run;

Basic Information

Checking Typos Looking in the section labeled Extreme Observations, we see some of the class sizes are -21 and -20, so it seems as though some of the class sizes somehow became negative, as though a negative sign was incorrectly typed in front of them. Let's do a proc freq for class size to see if this seems plausible. proc freq data="c:\sasreg\datamath"; tables size_k3; run;

The FREQ Procedure Frequency Missing = 2

Data Quality Control • It seems that some of the class sizes somehow got negative signs put in front of them. Let's look at the school and district number for these observations to see if they come from the same district. Indeed, they all come from district 140. proc print data="c:\sasreg\datamath"; where (size_k3 < 0); var snum dnum size_k3; run;

Missing Value with . • Notice that when we looked at the observations where (size_k3 < 0) this also included observations where size_k3 is missing (represented as a period). SAS treat missing “.” as less than 0. • To be more precise, the above command should exclude such observations like this. • proc print data="c:\sasreg\datamath"; • where (size_k3 < 0) and (size_k3 ^= .); • var snum dnum size_k3; • run;

Further Checking School of dnum=140 It seems the school 0f 140 has data entry problem. Hence we may read all data from this school. proc print data="c:\sasreg\datamath"; where (dnum = 140); var snum dnum size_k3; run;

Note • In fact, this typo was made up for this lecture, the original data was correct. The actual data error would be more complex, more unobservable and unpredictable. • Data quality control should be examined before we are going to analyze the data.

Examine Data by Graph Histogram is a graphical display of tabulated frequencies. It shows what proportion of cases fall into each of several categories. A histogram differs from a bar chart in that it is the area of the bar that denotes the value, not the height, a crucial distinction when the categories are not of uniform width (Lancaster, 1974). The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent. Now, we show a histogram for size_k3. This shows us the observations where the average class size is negative. proc univariate data="c:\sasreg\datamath"; var size_k3 ; histogram / cfill=gray; run;

Box Plot & Stem Leaf Plot Likewise, a boxplot and stem-and-leaf plot would have called these observations to our attention as well. In SAS you can use the plot option with proc univariate to request a boxplot and stem and leaf plot. Below we show just the combined boxplot and stem and leaf plot from this output. You can see the outlying negative observations way at the bottom of the boxplot. proc univariate data="c:\sasreg\datamath" plot; var size_k3; run;

The UNIVARIATE Procedure Variable: size_k3 (avg class size k-3)

Data Checking (cont) • It is necessary to plot all of these graphs for the variables you will be analyzing. We will omit, due to space considerations, showing these graphs for all of the variables. • Here, in examining the variables, the stem-and-leaf plot for F_Teach seemed rather unusual. Up to now, we have not seen anything problematic with this variable, but look at the stem and leaf plot for F_Teach below. It shows 104 observations where the percent with a F_Teach credential that is much lower than all other observations. This is over 25% of the schools and seems very unusual. • proc univariate data="c:\sasreg\datamath" plot; • var F_Teach; run;

The UNIVARIATE Procedure Variable: F_Teach (pct F_Teach credential)

Chapter 2 - Simple and Multiple Regression