Basics of Data Cleaning

Basics of Data Cleaning

Why Examine Your Data? • Basic understanding of the data set • Ensure statistical and theoretical underpinnings of a given m.v. technique are met • Concerns about the data • Departures from distribution assumptions (i.e., normality) • Outliers • Missing Data

Testing Assumptions • MV Normality assumption • Solution is better • Violation of MV Normality • Skewness (symmetry) • Kurtosis (peakedness) • Heteroscedascity • Non-linearity

Negative Skew

Positive Skew

Kurtosis Mesokurtic Leptokurtic Platykurtic

Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05  +/- 1.96 .01 +/- 2.58

s24 s22 s23 s21 m4 m2 m3 m1 Homoscedascity s21 = s22 = s23 = s24 = s2e When there are multiple groups, each group has similar levels of variance (similar standard deviation)

Linearity

Testing the Assumptions of Absence of Correlated Errors • Correlated errors means there is an unmeasured variable affecting the analysis • Key is to identify the unmeasured variable and to include it in the analysis • How often do we meet this assumption?

Data Cleaning • Examine • Individual items/scales (i.e., reliability) • Bivariate relationships • Multivariate relationships • Techniques to use • Graphs  non-normality, heteroscedasticity • Frequencies  missing data, out of bounds values • Univariate outliers (+/- 3 SD from mean) • Mahalanobis Distance (.001)

Graphical Examination • Single Variable: Shape of Distribution • Histogram • Stem and leaf • Relationships between two+ variables • Scatterplot

Histogram

Scatterplot

Frequencies

Outliers • Where do outliers come from? • Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) • Legitimate data points* • Extreme values of random error (X = t + e) • Error in observation • Error in data preparation

Univariate Outliers • Criteria: Mean +/- 3 SD • Example: Age • Mean = 34.68 • SD = 10.05 • Out of range values > 64.83 or < 4.53

Univariate Outliers

Multivariate Outliers Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables - 13.82 three variables - 16.27 four variables - 18.46 five variables - 20.52 six variables - 22.46

Approaches to Outliers • Leave them alone • Delete entire case (listwise) • Delete only relevant variables (pairwise) • Trim – highest legitimate value • Mean substitution • Imputation

Effects of Outliers r = .50 r = .32

Effects of Outliers

Major Problems: Missing Data • Generalizability issues • Reduces power (sample size) • Impacts accuracy of results • Accuracy = dispersion around true score (can be under- or over-estimation) • Varies with MDT used

Dealing with Missing Data • Listwise deletion • Pairwise deletion • Mean substitution • Regression imputation • Hot-deck imputation • Multiple imputation

Dealing with Missing Data In Order of Accuracy: • Pairwise deletion • Listwise deletion • Regression imputation • Mean substitution • Hot-deck imputation

Dealing with Missing Data

Best Transformation to Try Square Root Log Inverse “Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Transformations • Interpretation of transformed variables?

Basics of Data Cleaning

Basics of Data Cleaning

Presentation Transcript

Basics of Cleaning

basics of data transmission

Data Guard Basics

Data Modeling Basics

Data Cleaning

Basics of gerontology, demographic data

The Basics of Commercial Kitchen Exhaust Cleaning

Data Access Basics

Data Mining Basics: Data

Data Converter Basics

Basics of Data Compression

Data Translation, Inc. Basics of Data Acquisition

Basics of a dental cleaning or scaling

Basics of kitchen exhaust cleaning

Data Mining – Basics of Bioinformatics

Basics of Data Transmission

Data Modeling Basics

PPS BASICS: DATA

Data Basics

Basics of Big Data Analytics