1 / 44

Data Preprocessing

Data Preprocessing. CSC 576: Data Science. Today. Data Preprocessing Handling Missing Values Handling Outliers Covariance and Correlation Normalization Binning, Discretization Sampling Aggregation. Data Preprocessing. Data Exploration phase results in finding data quality issues

nitsa
Télécharger la présentation

Data Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Preprocessing CSC 576: Data Science

  2. Today • Data Preprocessing • Handling Missing Values • Handling Outliers • Covariance and Correlation • Normalization • Binning, Discretization • Sampling • Aggregation

  3. Data Preprocessing • Data Exploration phase results in finding data quality issues • Outliers, missing values, … • Data Preprocessing usually delayed until the Modeling phase • Different predictive models require different preprocessing

  4. Handling Missing Data • Motivation: We will frequently encounter missing values, especially in big data. • Lots of fields, lots of observations • Question: how to handle the missing data?

  5. How much data is missing? • Suppose… • Dataset of 30 variables • 5% of data is missing • Missing values are spread evenly throughout data • … • 80% of records would have at least one missing value • Bad solution: Deleting records with any missing data

  6. Handling Missing Values (Approaches) • Drop any features that have missing values. • Might lead to massive loss of data • Drop any instance/record that has a missing value. • Might lead to bias • Derive a missing indicator feature from features with missing values. • Replace with a binary feature, whether the value was missing or not. • Ignore missing values • If data mining method is robust • Replace missing value with some constant (Specified by analyst) • Replace missing value with mean, median, or mode • Replace missing value with value generated at random from the observed variable distribution • Replace missing value with imputed values, based on other characteristics of the record

  7. Handling Missing Values Be careful using imputation on features missing in excess of 30% of their values. • Imputation:replaces missing feature values with a plausible estimated value • Common approach: replace missing values for a feature with a measure of the central tendency of that feature • Mean, median (continuous) • Mode (categorical)

  8. Reclassifying Categorical Variables • Sometimes a categoricalvariable will contain too many factors to be easily analyzable • Example: state field could contain 50 different values • Solution #1: reclassify state into its region: {NorthEast, NorthWest, Central, West, …} • Solution #2: reclassify state by its economic level: {WealthyStates, MiddleStates, PoorestStates} • Up to the analyst to appropriately reclassify

  9. Handling Outliers: Clamp Transformation • Clamps all values above an upper threshold and below a lower threshold to remove outliers • Upper and lower thresholds can be set manually based on domain knowledge • Or: • Lower = 1st quartile -1.5 x inter-quartile range • Upper = 3rd quartile + 1.5 x inter-quartile range Lower and upper are the specific thresholds.

  10. Case Study • What handling strategies would you recommend for the data quality issues found in the motor Insurance fraud dataset (previous slide)? • Num Soft Tissue –imputation (using median) on the 2% of missing values • Claim Amount– clamp transformation on outliers (manually set) • Amount Received – clamp transformation on outliers (manually set)

  11. Covariance and Correlation • Visualpreliminary exploration, comparing two variables, using scatter plots • Quantitativepreliminary exploration using covariance and correlation measures • Covariance:

  12. Covariance • values fall into the range [−∞, ∞] : • negative values indicate a negative relationship • positive values indicate a positive relationship • values near zero indicate that there is little or no relationship between the features

  13. Example • Calculating covariance between the HEIGHT feature and the WEIGHTand AGEfeatures from the basketball players dataset.

  14. Correlation • normalized form of covariance • Values ranges between −1 and +1 • Correlation:

  15. Correlation • values fall into the range [−1, 1] • values close to −1 indicate a very strong negative correlation (or covariance) • values close to 1 indicate a very strong positive correlation • values around 0 indicate no correlation • Features that have no correlation are said to be independent.

  16. Example • Calculating correlation between the HEIGHT feature and the WEIGHT and AGE features from the basketball players dataset.

  17. Covariance and Correlation Matrix • There are usually multiple continuous features in a dataset to explore. • Covariance and Correlation Matrix for the display of every combination of features

  18. Example

  19. Including Correlations in a Scatter Plot Matrix

  20. Correlation does not Imply Linear Relationship • Anscombe’s Quartet: • Each has a correlation value of 0.816 between x and y • Correlation is a good measure of the relationship between two continuous features, • but it is not by any means perfect. • Still need visual analysis.

  21. Correlation does not Imply Causation • Causation can be mistakenly assumed: • Mistaking the order of a causal relationship • Example: Spinning windmills cause wind. • Example: Playing basketball causes people to be tall. • Inferring causation between two features while ignoring a third (hidden) feature. • From Nature, 1999. • “causal relationship between young children sleeping with a night-light turned on and these children developing short-sightedness in later life” • Short-sighted parents, because of poor night vision, tend to favor the use of night-lights; short-sighted parents are more likely to have short-sighted children.

  22. Spurious Correlations • http://tylervigen.com/discover

  23. Data Preparation • Changing the way data is represented just to make it more compatible with certain machine learning algorithms: • Normalization • Binning • Sampling

  24. Normalization Typical ranges used for normalizing feature values are [0,1] and [-1,1]. • change a continuousfeature to fall within a specified range while maintaining the relative differences between the values for the feature • Example: • Customer ages in a dataset: [16, 96] • Customer salaries: [10000, 100000] • Range Normalization:convert a feature value into the range [low , high] Sensitive to the presence of outliers in a dataset.

  25. Standardization Majority of feature values will be in a range of [-1, 1]. • measures how many standard deviations a feature value is from the mean for that feature • mean = 0, standard deviation = 1 • “Standard Scores” • Standardization assumes the feature values are normally distributed. • If not, standardization may introduce some distortions.

  26. Example

  27. Binning • converting a continuousfeature into a categoricalfeature • define a series of ranges (called bins) for the continuous feature that correspond to the levels of the new categorical feature • Approaches: • equal-width binning • equal-frequency binning

  28. Choosing the # of Bins • Need to “manually” decide the number of bins: • Choosing a low number may lose a lot of information • Choosing a very high number might result in a very few instances in each bin or empty bins

  29. Equal-Width Binning • Splits the range of the feature values into b bins each of size • Usually works well • Some near-empty bins when data follows a normal distribution

  30. More accurately models the heavily populated areas of the continuous feature, compared to equal-width. • Slightly less intuitive because bins are of varying sizes. Equal-Frequency Binning • Algorithm: • Sorts the continuous feature values into ascending order • Then places an equal number of instances into each bin, starting with bin 1 • Number of instances placed in each bin =

  31. Data Preparation: Sampling • Sometimes we have too much data! • instead samplea smaller percentage from the larger dataset • Care required when sampling: • Try to ensure that the resulting datasets is still representative of the original data and that no unintended biasis introduced during this process. • If not, any modeling on the sample will not be relevant to the overall dataset.

  32. Sampling: Top Sampling • Select the top s% of instances from a dataset to create a sample. • Top sampling runs a serious risk of introducing bias • the sample will be affected by any ordering of the original dataset • Usually avoided

  33. Sampling: Random Sampling • randomly selects a proportion of s% of the instances from a large dataset to create a smaller set. • good choice in most cases as the random nature of the selection of instances should avoid introducing bias

  34. Other Sampling Forms • Stratified Sampling • Ensures that the relative frequencies of the levels of a specific feature are maintained • Usage: if there are one or more levels of a categorical feature that only have a very small proportion of instances (chance they will omitted by random sampling) • Under-Sampling or Over-Sampling • Sample containing different relative frequencies • Usage: if we want to have a particular categorical feature be represented equally in the sample, even if that was not the distribution in the original dataset

  35. Aggregation • Combining two or more attributes into a single attribute • Or – combining two or more objects into a single object • Purpose: • Data reduction: reduce # of attributes/objects • Change of scale: high-level vs. low-level • Motivations: • Less memory and processing time

  36. Aggregation – Data Reduction

  37. Aggregation – Change in Variability • Less variability at “higher-level” view

  38. When to Remove Variables • Variables that will not help the analysis should be removed • Unary variables: take on a single value • Example: gender variable for students at an all-girls school • Variables that are nearly unary • Example: gender of football athletes at elementary school • 99.95% of the players are male • Some data mining algorithms may treat the variable as unary • Not enough data to investigate the female players anyway…

  39. When to Remove Variables • Think carefully before removing variables because of: • 90% of the values are missing • Strong correlation between two variables

  40. When to Remove Variables • 90% of the values are missing • Are the values that are present representative or not? • If the present values are representative, then either (1) remove the variable or (2) impute the values. • If the present values are non-representative, their presence adds value. • Scenario:donation_dollarsfield in a self-reported survey • Assumption: those who donate a lot are more inclined to report their donation • Could also binarize the variable: donation_flag

  41. When to Remove Variables • Strong correlation between two variables • Inclusion of correlated variables may “double-count” a particular aspect of the analysis, depending on the machine learning technique used. • Example: precipitation and people on a beach • Strategy #1: remove one of the two correlated variables • Strategy #2: use PCA to transform the variables (beyond scope of course)

  42. Id Fields • Id fields have a different value for each record • Won’t be helpful in predictive analysis • If they are, the relationship is usually spurious • Recommended Approach: • Don’t include Id field in modeling • But keep it in the dataset to differentiate between records

  43. References • Fundamentals of Machine Learning for Predictive Data Analytics, Kelleher et al., First Edition

More Related