Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan
Learning Objectives • To understand the application of regression analysis in data mining • Linear/nonlinear • Logistic (Logit) • To understand the key statistical measures of fit • To learn how to run and interpret regression analyses using SAS Enterprise Miner software
Analysis of Association In business problems interests often go beyond the statistical testing of differences (e.g., female versus male preferences) Often interested in degree of association between variables. Regression is one of the techniques that helps uncover those relations.
Expected value of y (outcome) Intercept Term Predictor variable coefficient Linear Regression Analysis • Analysis of the strength of the linear relationship between predictor (independent) variables and outcome (dependent/criterion) variables. • In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram. E(y) = b0 + b1 (x)
Sample Data: x y x1 y1 . . . . xnyn Estimated Regression Equation Sample Statistics b0, b1 Estimation Process Regression Model y = b0 + b1x +e Regression Equation E(y) = b0 + b1x Unknown Parameters b0, b1 b0 and b1 provide estimates of b0 and b1
Simple Linear Regression Equation:Positive Linear Relationship E(y): Outcome Regression line Intercept b0 Slope b1 is positive x : Predictor
Simple Linear Regression Equation:Negative Linear Relationship E(y): Outcome Regression line Intercept b0 Slope b1 is negative x: Predictor
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Simple Linear Regression Equation:No Relationship E(y): Outcome x: Predictor
Simple Linear Regression Equation:No Relationship E(y) Regression line Intercept b0 Slope b1 is 0 x
• • • • • • • • • ••••••••••••••••••••••• Simple Linear Regression Equation:Parabolic Relationship E(y): Outcome Intercept b0 x: Predictor
Example • List Variables we have • Determine a DV of interest • Is there a way to predict DV?
where: yi = observed value of the dependent variable for the ith observation ^ yi = estimated value of the dependent variable for the ith observation Least Squares Method • Least Squares Criterion: minimize error (distance between actual data & estimated line)
Least Squares Method • Slope for the Estimated Regression Equation
_ _ x= mean value for independent variable y= mean value for dependent variable Least Squares Method • y-Intercept for the Estimated Regression Equation where: xi = value of independent variable for ith observation yi = value of dependent variable for ith observation n = total number of observations
Predicted Line Actual Data Least Squares Estimation Procedure • Least Squares Criterion: The sum of the vertical deviations (y axis) of the points from the line is minimal.
Example: Kwatts vs. Temp Temp Kwatts 59.2 9,730 61.9 9,750 55.1 10,180 66.2 10,230 52.1 10,800 69.9 11,160 46.8 12,530 76.8 13,910 79.7 15,110 79.3 15,690 80.2 17,020 83.3 17,880
Example Results Let X = Temp, Y = Kwatts Y = 319.04 + 185.27 X
SST = SSR + SSE where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error Coefficient of Determination • How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables). • Relationship Among SST, SSR, SSE
Coefficient of Determination (r2) r2 = SSR/SST where: SSR = sum of squares due to regression SST = total sum of squares
Kwatts vs. Temp Example df SS Regression 1 58784708.31 Residual 10 38696916.69 Total 11 97481625 r2 = 0.603033734 Does the linear regression provide a good fit?
Assumptions About the Error Term e 1. The erroris a random variable with mean of zero. 2.The variance of , denoted by 2, is the same for all values of the independent variable. 3.The values of are independent. 4.The erroris a normally distributed random variable.
Significance Test for Regression Is the value of b1zero? Two tests are commonly used: F Test t Test and Both thettest and Ftest require an estimate of the variance (s2) of the error (e). As in most of our statistical work, we are working with a sample, not the population, so we use mean square error (s2).
Testing for Significance • An Estimate of s s2 = MSE = SSE/(n - 2) where:
Testing for Significance • An Estimate of s • To estimate swe take the square root of s 2. • The resulting sis called the standard error of • the estimate.
Testing for Significance: t Test • Hypotheses: Coefficient (b1) is 0 (no relationship between predictor & outcome) • Calculating t Statistic:
Testing for Significance: t Test 1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. 4. State the rejection rule. Reject if p-value < .05 or |t| > 3.182 (with 3 degrees of freedom)
Alternative Test: F Test • Same Hypotheses: • Different Test Statistic: F = MSR/MSE
Testing for Significance: FTest • Reject if: p-value<a or F>F F = MSR/MSE where: Fis based on an Fdistribution with 1 degree of freedom in the numerator and n- 2 degrees of freedom in the denominator
Testing for Significance: FTest 1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. F = MSR/MSE 4. State the rejection rule. Reject if p-value < .05 or F > 10.13 (with1 d.f. in numerator and 3 d.f. in denominator)
Standard Error of the Estimate • Standard Error of Estimate has properties analogous to those of standard deviation. • How “good” is our “fit”? • Interpretation is similar: • ~68% of outcomes/predictions within one sest. • ~95% of outcomes/predictions within two sest.
Kwatts vs. Temp Example ANOVA df SS MS F Significance F Regression 1 58784708.31 58784708.31 15.19 0.002972726 Residual 10 38696916.69 3869691.669 Total 11 97481625 Coefficients Standard Error t Stat P-value Intercept 319.0414124 3260.412811 0.097853073 0.923982528 Temp 185.2702073 47.53479059 3.897570706 0.002972726 Is the regression model statistically significant? Is the coefficient of Temp significant?
Cautions about Interpreting Significance Tests • Statistical significance does not mean linear relationship between x and y. • Relationship between x and ydoes not mean a cause-and-effect relationship is present between x and y.
SAS Enterprise Miner • These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3 • Using SAS Enterprise Miner requires the following steps: • Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 • Create a project in Enterprise Miner • Within the project: • Create a data source using your SAS data file • Create a diagram that includes a data node and a regression node and a multiplot node for graphs • Run the model in the diagram and review the results
Creating a SAS data file from an Excel file: open SAS 9.1. Select File then Import Data
This opens the import wizard. Since the source file is from Excel, click Next. Then click Browse to find the TempKWatts.xls file
Since the data are on sheet1$, click Next. Then enter SASUSER as the Library and TEMPKILOWATTL as the Member. Then click Next
Open SAS Enterprise Miner 5.3. Enter the user name and password provided
The Create New Project dialog box appears. Select the General tab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.
In the Startup code tab, enter:libname Ktemps "C:\Documents and Settings\mliberat\My Documents\My SAS Files\9.1\EM_Projects"; This code will be run each time you open the project
Right-click on Data Source, opening the wizard. Source is SAS table, so click Next
Browse the SAS libraries to find the SAS table Tempkilowattl found in the SASuser Library (previously created)
Click Next twice. Note that the Table properties shows that we have two variables with 12 observations
The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customize before clicking Next)
Change Role of KWatts to target (outcome variable); change Level of both KWatts and Temp to interval (continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats – we will do this later
Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click Finish