1 / 33

Data Quality Data Exploration

Data Quality Data Exploration. CSC 576: Data Mining. Today. Data Quality Data Exploration. Data Quality Report.

graciek
Télécharger la présentation

Data Quality Data Exploration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data QualityData Exploration CSC 576: Data Mining

  2. Today • Data Quality • Data Exploration

  3. Data Quality Report • A data quality report includes tabular reports that describe the characteristics of each feature in a dataset using standard statistical measures of central tendencyand variation. • In KNA textbook, ABT refers to “Analytics Base Table” • The tabular reports are accompanied by data visualizations: • histogramfor each continuous feature • bar plotfor each categorical feature • also generally used for continuous features with cardinality < 10

  4. Tabular Structure in a Data Quality Report • Card = Cardinality • Measures the number of distinct values present for a feature Note the differences between each table.

  5. Case Study: ABT for Motor Insurance Claims Fraud Detection

  6. Data Exploration: Getting to Know the Data • For categorical features: • Examine the mode, 2nd mode, mode %, and 2nd mode % • Represent the most common levels within these features • Will identify if any levels dominate the dataset. • For continuous features: • Examine the mean and standard deviation of each feature • Get a sense of the central tendency and variation of the values • Examine the minimum and maximum values to understand the range that is possible for each feature • Histograms of continuous features will resemble the following well understood shapes (probability distributions) • Recognizing the distribution of values for a feature will be useful when applying machine learning models

  7. Uniform Distribution Sometimes indicative of a feature such as an ID, rather than something more interesting • A uniform distribution indicates that a feature is equally likely to take a value in any of the ranges present.

  8. Naturaly occurring phenomena (heights, weights of a randomly selected group of men, women) tend to follow a normal distribution. Normal Distribution • Features following a normal distribution are characterized by a strong tendency towards a central value and symmetrical variation to either side of this. • Unimodal:single peak around the central tendency

  9. Skewed Distributions • Skew is simply a tendency towards very high (right skew) or very low (left skew) values.

  10. Exponential Distribution Examples: number of times a person has been married; number of times a person has made an insurance claim • In a feature following an exponential distribution the likelihood of occurrence of a small number of low values is very high, but sharply diminishes as values increase.

  11. Example: measure of heights of a randomly selected group of Irish men and women Multimodal Distribution • A feature characterized by a multimodal distribution has two or more very commonly occurring ranges of values that are clearly separated. • Bi-modal distribution: two clear peaks • “two normal distributions pushed together” • Tends to occur when a feature contains a measurement made across two distinct groups

  12. Normal Distribution • The probability density function for the normaldistribution (or Gaussian distribution) is • x is any value • μ (mu) and σ (sigma) are parameters that define the shape of the distribution • the population mean and population standard deviation

  13. Standard normal distribution: μ = 0 and σ = 1.

  14. 68-95-99.7 Rule • The 68 − 95 − 99.7 rule is a useful characteristic of the normal distribution. • The rule states that approximately: • 68% of the observations will be within oneσ of μ • 95% of observations will be within twoσ of μ • 99.7% of observations will be within threeσ of μ. Very low probability of observations occurring that differ from the mean by more than two standard deviations.

  15. Case Study • Become familiar with the central tendency and variation of each feature using the data quality report. • Note bar graphs and histograms (earlier slides). • Note number of levels and frequency of Injury Type. • What is the type of probability distribution for each histogram? • Exponential Distribution: all except Incomeand Fraud Flag • Normal Distribution: Income(except for the 0 bar) • Fraud Flag: not a typical continuous feature

  16. Identifying Data Quality Issues • A data quality issueis loosely defined as anything unusual about the data in an ABT. • The most common data quality issues are: • missing values • Rule of thumb: remove feature if more than 60% of data is missing • irregular cardinality • Cardinality of 1: everything has the same value; no useful predictive information • Continuous features will usually have a cardinality value close to the number of instances • Investigate further if cardinality seems much lower or higher than expected • outliers (invalid vs. valid) • Investigate using domain knowledge • Compare gap between 3rd quartile and max vs. median and 3rd quartile

  17. Case Study (refer to earlier tables and graphs) • Missing Values • Remove Marital Statusfeature • Note Incomefeature • Irregular Cardinality • No predictive information in Insurance Type • Fraud Flagshould be categorical feature • Other valid features with very low cardinality • Outliers • Unusual minimum value in instance #3 • Claim Amount, Total Claimed, Num Claims, Amount Receivedseem to have high maximum values compared to the 3rd quartile and median • Locate instance in the dataset that leads to high maximum values (instance #460) • Judge if it is a valid or invalid outlier

  18. Identifying Data Quality Issues • Data quality issues possible due to invalid data. • Need to be corrected! (e.g. calculation errors, data entry errors, …) • Data quality issues possible due to valid data. • Sometimes ok, sometimes not. (Depends on the machine learning model.) • (e.g. missing data)

  19. Data Quality • Unrealistic to expect that data will be perfect • Some data mining algorithms are more susceptible to data quality issues • Want to avoid “garbage in garbage out” • Data cleaning phase for detection and correction of data issues often necessary during preprecessing

  20. Measurement and Data Collection Errors • Measurement error: any problem resulting from the measurement process; value recorded differs from true value to some extent • Data collection error: • data objects are omitted • attribute values are missing for some objects • inappropriately including a data object

  21. Outliers • Data objects that have characteristics that differ from most other data objects • In fraud detection, the goal is identifying these outliers • Value of an attribute is very unusual with respect to the typical value • Do we have a “data error?” or is some individual really eight foot tall? • Various statistical definitions for what an outlier is. • Outliers can be legitimate data objects or values (and may be of interest).

  22. Missing Values • Often, values for some attributes are missing for some objects in data sets • Example: individuals who decline to provide their weight in a survey • What to do?

  23. Strategies for Dealing with Missing Data • Eliminate data objects that have missing values • Eliminate data attributes if any objects are missing that value • Estimate missing values • Data set may contain similar data points • Ignore missing values • If data mining method is robust

  24. Inconsistent Values • Example: • Data object with address, city, zip code in three separate fields • But address / city is in a different zip code • Some inconsistencies are easy to detect (and fix) automatically; others are not.

  25. Duplicate Data • Example: • many people receive duplicate mailings because they are in a database multiple times under slightly different names

  26. Other Issues • Timeliness • Data starts to age as soon as it has been collected • Example: general population of users interact with Facebook differently than they did so 2 years ago • Relevance • Sampling bias: occurs when a sample is not representative of the overall population • Example: survey data describes only those who responded to the survey

  27. Other Issues • The data sets needs to contain attributes which are relevant for the overall problem • Example: Constructing an accurate model that predicts the accident rate for drivers might be fruitless without features such as: • age, previous accident history, # of speeding tickets, etc.

  28. Knowledge about the Data • Ideally data sets are accompanied by documentation that describes different aspects of the data • Read it! • Example: contains information that missing values for a particular field are coded as -9999 • Should also document the type of feature (nominal, etc.) and its measurement scale (meters or feet, etc.)

  29. References • Fundamentals of Machine Learning for Predictive Data Analytics, Kelleher et al., First Edition

More Related