Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Regression for Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Regression for Data Mining**Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan**Learning Objectives**• To understand the application of regression analysis in data mining • Linear/nonlinear • Logistic (Logit) • To understand the key statistical measures of fit • To learn how to run and interpret regression analyses using SAS Enterprise Miner software**Analysis of Association**In business problems interests often go beyond the statistical testing of differences (e.g., female versus male preferences) Often interested in degree of association between variables. Regression is one of the techniques that helps uncover those relations.**Expected value of y (outcome)**Intercept Term Predictor variable coefficient Linear Regression Analysis • Analysis of the strength of the linear relationship between predictor (independent) variables and outcome (dependent/criterion) variables. • In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram. E(y) = b0 + b1 (x)**Sample Data:**x y x1 y1 . . . . xnyn Estimated Regression Equation Sample Statistics b0, b1 Estimation Process Regression Model y = b0 + b1x +e Regression Equation E(y) = b0 + b1x Unknown Parameters b0, b1 b0 and b1 provide estimates of b0 and b1**Simple Linear Regression Equation:Positive Linear**Relationship E(y): Outcome Regression line Intercept b0 Slope b1 is positive x : Predictor**Simple Linear Regression Equation:Negative Linear**Relationship E(y): Outcome Regression line Intercept b0 Slope b1 is negative x: Predictor**•**• • • • • • • • • • • • • • • • • • • • • • • • • • • • • Simple Linear Regression Equation:No Relationship E(y): Outcome x: Predictor**Simple Linear Regression Equation:No Relationship**E(y) Regression line Intercept b0 Slope b1 is 0 x**•**• • • • • • • • ••••••••••••••••••••••• Simple Linear Regression Equation:Parabolic Relationship E(y): Outcome Intercept b0 x: Predictor**Example**• List Variables we have • Determine a DV of interest • Is there a way to predict DV?**where:**yi = observed value of the dependent variable for the ith observation ^ yi = estimated value of the dependent variable for the ith observation Least Squares Method • Least Squares Criterion: minimize error (distance between actual data & estimated line)**Least Squares Method**• Slope for the Estimated Regression Equation**_**_ x= mean value for independent variable y= mean value for dependent variable Least Squares Method • y-Intercept for the Estimated Regression Equation where: xi = value of independent variable for ith observation yi = value of dependent variable for ith observation n = total number of observations**Predicted Line**Actual Data Least Squares Estimation Procedure • Least Squares Criterion: The sum of the vertical deviations (y axis) of the points from the line is minimal.**Example: Kwatts vs. Temp**Temp Kwatts 59.2 9,730 61.9 9,750 55.1 10,180 66.2 10,230 52.1 10,800 69.9 11,160 46.8 12,530 76.8 13,910 79.7 15,110 79.3 15,690 80.2 17,020 83.3 17,880**Example Results**Let X = Temp, Y = Kwatts Y = 319.04 + 185.27 X**SST = SSR + SSE**where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error Coefficient of Determination • How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables). • Relationship Among SST, SSR, SSE**Coefficient of Determination (r2)**r2 = SSR/SST where: SSR = sum of squares due to regression SST = total sum of squares**Kwatts vs. Temp Example**df SS Regression 1 58784708.31 Residual 10 38696916.69 Total 11 97481625 r2 = 0.603033734 Does the linear regression provide a good fit?**Assumptions About the Error Term e**1. The erroris a random variable with mean of zero. 2.The variance of , denoted by 2, is the same for all values of the independent variable. 3.The values of are independent. 4.The erroris a normally distributed random variable.**Significance Test for Regression**Is the value of b1zero? Two tests are commonly used: F Test t Test and Both thettest and Ftest require an estimate of the variance (s2) of the error (e). As in most of our statistical work, we are working with a sample, not the population, so we use mean square error (s2).**Testing for Significance**• An Estimate of s s2 = MSE = SSE/(n - 2) where:**Testing for Significance**• An Estimate of s • To estimate swe take the square root of s 2. • The resulting sis called the standard error of • the estimate.**Testing for Significance: t Test**• Hypotheses: Coefficient (b1) is 0 (no relationship between predictor & outcome) • Calculating t Statistic:**Testing for Significance: t Test**1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. 4. State the rejection rule. Reject if p-value < .05 or |t| > 3.182 (with 3 degrees of freedom)**Alternative Test: F Test**• Same Hypotheses: • Different Test Statistic: F = MSR/MSE**Testing for Significance: FTest**• Reject if: p-value<a or F>F F = MSR/MSE where: Fis based on an Fdistribution with 1 degree of freedom in the numerator and n- 2 degrees of freedom in the denominator**Testing for Significance: FTest**1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. F = MSR/MSE 4. State the rejection rule. Reject if p-value < .05 or F > 10.13 (with1 d.f. in numerator and 3 d.f. in denominator)**Standard Error of the Estimate**• Standard Error of Estimate has properties analogous to those of standard deviation. • How “good” is our “fit”? • Interpretation is similar: • ~68% of outcomes/predictions within one sest. • ~95% of outcomes/predictions within two sest.**Kwatts vs. Temp Example**ANOVA df SS MS F Significance F Regression 1 58784708.31 58784708.31 15.19 0.002972726 Residual 10 38696916.69 3869691.669 Total 11 97481625 Coefficients Standard Error t Stat P-value Intercept 319.0414124 3260.412811 0.097853073 0.923982528 Temp 185.2702073 47.53479059 3.897570706 0.002972726 Is the regression model statistically significant? Is the coefficient of Temp significant?**Cautions about Interpreting Significance Tests**• Statistical significance does not mean linear relationship between x and y. • Relationship between x and ydoes not mean a cause-and-effect relationship is present between x and y.**SAS Enterprise Miner**• These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3 • Using SAS Enterprise Miner requires the following steps: • Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 • Create a project in Enterprise Miner • Within the project: • Create a data source using your SAS data file • Create a diagram that includes a data node and a regression node and a multiplot node for graphs • Run the model in the diagram and review the results**Creating a SAS data file from an Excel file: open SAS 9.1.**Select File then Import Data**This opens the import wizard. Since the source file is from**Excel, click Next. Then click Browse to find the TempKWatts.xls file**Since the data are on sheet1$, click Next. Then enter**SASUSER as the Library and TEMPKILOWATTL as the Member. Then click Next**Open SAS Enterprise Miner 5.3. Enter the user name and**password provided**The Create New Project dialog box appears. Select the**General tab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.**In the Startup code tab, enter:libname Ktemps "C:\Documents**and Settings\mliberat\My Documents\My SAS Files\9.1\EM_Projects"; This code will be run each time you open the project**Right-click on Data Source, opening the wizard. Source is**SAS table, so click Next**Browse the SAS libraries to find the SAS table Tempkilowattl**found in the SASuser Library (previously created)**Click Next twice. Note that the Table properties shows that**we have two variables with 12 observations**The next step controls how Enterprise Miner organizes**metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customize before clicking Next)**Change Role of KWatts to target (outcome variable); change**Level of both KWatts and Temp to interval (continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats – we will do this later**Here Role relates to the role of the data set (raw, train,**validate, score); raw is fine for our analysis of data, so click Finish