Unit 9: Dealing with Messy Data I: Case Analysis

Unit 9: Dealing with Messy Data I: Case Analysis

Anscombe’s Quartet lm(y1 ~ x, data = Quartet) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x 0.5001 0.1179 4.241 0.00217 ** --- Sum of squared errors (SSE): 13.8, Error df: 9 R-squared: 0.6665 lm(y2 ~ x, data = Quartet) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 3.001 1.125 2.667 0.02576 * x 0.500 0.118 4.239 0.00218 ** --- Sum of squared errors (SSE): 13.8, Error df: 9 R-squared: 0.6662 Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21. see Quartet dataframe in car package

Case Analysis • Goal is to identify any unusual or excessively influential data • These data point may either bias results and/or reduce power to detect effects (inflate standard errors and/or decrease R2) • Three aspects of individual observations we attend to: • Leverage • Regression Outlier • Influence • Case Analysis also provides an important first step as you get to “know” your data.

Case Analysis: Unusual and Influential Data setwd('P:\\CourseWebsites\\PSY710\\Data\\Diagnostics') d1 = dfReadDat ('DOSE2.dat') d1$Sex = as.numeric(d1$Sex) - 1.5 m1= lm(SP ~ BAC + TA + Sex, data=d1) modelSummary(m1) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 21.85097 7.38361 2.959 0.00392 ** BAC -196.07232 83.21315 -2.356 0.02058 * TA 0.14553 0.03119 4.666 1.04e-05 *** Sex -20.01198 6.57956 -3.042 0.00307 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Sum of squared errors (SSE): 94386.3, Error df: 92 R-squared: 0.2950

Univariate Statistics and Graphs 1: Univariate Statistics (n’s, means, sd, min/max, shape) varDescribe(d1) var n mean sd median min max skew kurtosis BAC 1 96 0.06 0.04 0.06 0.0 0.14 -0.09 -1.09 TA 2 96 147.61 105.73 119.00 10.0 445.00 0.89 -0.06 Sex 3 96 0.01 0.50 0.50 -0.5 0.50 -0.04 -2.02 FPS 4 96 32.19 37.54 19.46 -98.1 162.74 0.62 1.93

Univariate Statistics and Graphs 2: Univariate Plots (histograms, rug, and density plots) varPlot(d1$FPS, ‘FPS’) See also: hist(), rug(), density() "Descriptive statistics: FPS" n mean sd median min max skew kurtosis 96 32.19 37.54 19.46 -98.1 162.74 0.62 1.93

Bivariate Correlations > corr.test(d1) Correlation matrix BAC TA Sex FPS BAC 1.00 -0.02 -0.07 -0.19 TA -0.02 1.00 -0.08 0.44 Sex -0.07 -0.08 1.00 -0.29 FPS -0.19 0.44 -0.29 1.00

Univariate Statistics and Graphs 3: Bivariate Plots (Scatterplot, Rug, & Density) spm(~FPS + BAC + TA + Sex, data=d1)

Leverage (Cartoon data) 4. Check for high Leverage points Leverage is a property of the predictors (DV is not considered for leverage analysis). An observation will have increased “leverage” on the results as its distance from the mean of all predictors increases. Which points have the most leverage in the 1 predictor example below?

Leverage Hat values (hi) provide an index of leverage. In the one predictor case hi = 1/N + (Xi – X)2 / Σ(Xj- X)2 With multiple predictors, hi measures the distance from the centroid (point of means) of the Xs. Hat values are bounded between 1/N and 1. The mean Hat value is P/N Rules of thumb hi > 3 * h for small samples (< 100) hi > 2 * h for large samples Do NOT blindly apply rules of thumb. Hat values should be separated from distribution of hi. View a histogram of hi NOTE: Mahalanobis (Maha) distance = (N - 1)(hi - 1/N). SPSS reports centered leverage (h - 1/N)

Leverage (Cartoon data) High leverage values are not always bad. In fact, in some cases they are good. Must also consider if they are regression outliers. WHY? R2 = SSE(Mean-only) – SSE(A) SSE(Mean-only) SEbi = sy(1-R2Y) 1 — * ———— * ———— si (N-k-1) (1-R2i) High leverage points that are fit well by model increase the difference between SSE(Mean-only) and SSE(A) which increases R2 High leverage points that are fit well also increase variance for predictor. This reduces the SE for predictors and yields more power. Well fit, high leverage points do NOT alter b’s

Leverage (Real Data) modelCaseAnalysis(m1, Type='hatvalues')

Regression Outlier (Cartoon data) 5. Check for Regression Outliers An observation that is not adequately fit by the regression model (i.e., falls very far from the prediction line) In essence, a regression outlier is a discrepant score with a large residual (ei). Which point(s) are Regression Outliers?

Regression Outlier There are multiple quantitative indicators to identify regression outliers including raw residuals (ei), standardized residuals (e'i), and studentized residuals (t'i). The preferred index is the studentized residual. t'i = ei / (SEe(-i) * (1-hi)) t'i follows a t-distribution with n-P-1 degrees of freedom Can use Bonferroni correction to test t’s for the studentized residuals. But again, not blindly. Should view a histogram of t'i. NOTE: SPSS calls these Studentized Deleted Residuals. Cohen calls these Externally Studentized Residual

Regression Outliers (Cartoon data) Regression outliers are always bad but they can have two different types of bad effects. WHY R2 = SSE(Mean-only) – SSE(A) SSE(Mean-only) SEbi = sy(1-R2Y) 1 — * ———— * ———— si (N-k-1) (1-R2i) Regression outliers increase SSE(A) which decreases R2. Decreased R2 leads to increased SEs for b’s. If outlier also has leverage can alter (increase or decrease) b’s

Regression Outlier (Real Data) modelCaseAnalysis(m1, Type='residuals')

Regression Outlier (Real Data) outlierTest(m1, cutoff= .05) rstudent unadjusted p-value Bonferonni p 0125 -4.39553 2.9872e-05 0.0028677

Influence (Cartoon data) • An observation is “influential if it substantially alters the fitted regression model (i.e., the coefficients and/or intercept). Two commonly used assessment methods: • Cooks distance • dfBetas • Which point(s) have the most Influence?

Cook's Distance Cook’s distance (Di) provides a single summary statistic to index how much influence each score has on the overall model. Cooks distance is based on both the “outlierness” (standardized residual) and leverage characteristics of the observation. Di = (E'i2 / P) * (hi / (1-hi)) Di > 4 / (N – P) has been proposed as a very liberal cutoff (identifies a lot of influential points). Di > qf(.5,P,N-P) has also been employed as very conservative. Identification of problematic scores should be considered in the context of the overall distribution of Di

Cook's Distance (Real Data) modelCaseAnalysis(m1, Type='cooksd')

Influence Bubble Plot (Real Data) modelCaseAnalysis(m1,Type='influenceplot') What are the expected effects of each of these points on the model?

dfBetas dfBetaij is an index of how much each regression coefficient (j= 0 – k) would change if the ith score was deleted. dfBetaij = bj – bj(-1) dfBetas (preferred) is the standardized form of the index dfBetas = dfBeta / SE bj(-i) |dfBetas| > 2 may be problematic. |dfBetas| > (2 / N) in larger samples (Belsley et al., 1980) Consider distribution with histogram! Also can visualize with added variable plot Problem is there can be many dfBetas (a set for each predictor and intercept). Most helpful when there is one “critical/focal effect.”

dfBetas (Real Data) lm.caseAnalysis(m1,Type='dfbetas')

Added Variable Plot (Real Data)

Impact on SEs In addition to altering regression coefficients (and reducing R2), problematic scores can increase the SEs (i.e., precision of estimation) of the regression coefficients. COVRATIO is an index that indicates how individual scores affect the overall precision of estimation (joint confidence region for set of coefficients) of the regression coefficients Observations that decrease the precision of estimation have COVRATIOS < 1.0. Belsley et al., (1980) proposed a cut off of: COVRATIOi < | 3* P/N -1 |

Impact on Ses (Real Data) modelCaseAnalysis(m1,Type='covratio')

Enter the Real World • So what do you do????

Overall Impact of Problem Scores: Real Data Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 21.85097 7.38361 2.959 0.00392 ** BAC -196.07232 83.21315 -2.356 0.02058 * TA 0.14553 0.03119 4.666 1.04e-05 *** Sex -20.01198 6.57956 -3.042 0.00307 ** --- Sum of squared errors (SSE): 94386.3, Error df: 92 R-squared: 0.2950 d2 = lm.removeCases(d1,c('0125')) m2 = lm(SP~BAC + BaseSTL + Sex, data=d2) summary(m2) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 26.4196 6.8223 3.873 0.000203 *** BAC -243.1829 76.7423 -3.169 0.002085 ** TA 0.1415 0.0285 4.964 3.2e-06 *** Sex -17.6754 6.0319 -2.930 0.004281 ** --- Sum of squared errors (SSE): 77856.2, Error df: 91 R-squared: 0.3330

Four Examples with Fake Data

Unit 9: Dealing with Messy Data I: Case Analysis

Unit 9: Dealing with Messy Data I: Case Analysis

Presentation Transcript

Dealing with Data

Case Analysis i

Dealing with Quantitative Data

Dealing with Data Quality

CLEANING UP MESSY DATA WITH OPEN REFINE

Dealing with data

Unit 27 Dealing with Challenging Behaviour

Dealing with Data

Dealing with Complaints Meeting 9

Unit 5: Dealing with AIDS

Dealing with Data

Dealing with Data: An Introduction

Dealing with Remote Data

Dealing with MASSIVE Data

Unit 4 Dealing With Aids

Unit 4 Dealing With

Case Analysis I- Lecture 9

Unit 9 (FINALLY!): Data Analysis

Dealing with Data

Dealing with Data