1 / 162

Data Quality and Data Cleaning: An Overview

Data Quality and Data Cleaning: An Overview. Tamraparni Dasu Theodore Johnson {tamr,johnsont}@research.att.com AT&T Labs - Research. Acknowledgements.

finn
Télécharger la présentation

Data Quality and Data Cleaning: An Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality and Data Cleaning: An Overview Tamraparni Dasu Theodore Johnson {tamr,johnsont}@research.att.com AT&T Labs - Research

  2. Acknowledgements • We would like to thank the following people who contributed to this tutorial and to our book,Exploratory Data Mining and Data Quality (Wiley) Deepak Agarwal, Dave Belanger, Bob Bell, Simon Byers, Corinna Cortes, Ken Church, Christos Faloutsos, Mary Fernandez, Joel Gottlieb, Andrew Hume, Nick Koudas, Elefteris Koutsofios, Bala Krishnamurthy, Ken Lyons, David Poole, Daryl Pregibon, Matthew Roughan, Gregg Vesonder, and Jon Wright.

  3. Learning Objectives • Data Quality: The nature of the beast • DQ is pervasive and expensive • The problems are messy and unstructured • Cannot be pigeonholed into clearly defined problems or disciplines • As a consequence, there is very little research and few systematic solutions. • Purpose of the tutorial • Define a framework for the problem of data quality, place the existing work within this structure • Discuss methods for detecting, defining, measuring and resolving data quality issues • Identify challenges and opportunities • Make research directions clear.

  4. Overview • Data Quality • Motivation • The meaning of data quality (1) • The data quality continuum • The meaning of data quality (2) • Data quality metrics • Data quality process • Where do problems come from • How can they be resolved • Technical Tools • Process management • Database • Metadata • Statistics • Case Study • Research directions and references

  5. Why DQ? • Data quality problems are expensive and pervasive • DQ problems cost hundreds of billions of $$$ each year. • Lost revenues, credibility, customer retention • Resolving data quality problems is often the biggest effort in a data mining study. • 50%-80% of time in data mining projects spent on DQ • Interest in streamlining business operations databases to increase operational efficiency (e.g. cycle times), reduce costs, conform to legal requirements

  6. The Meaning of Data Quality (1)

  7. Meaning of Data Quality (1) • Data do not conform to expectations, opaque • Data do not match up to specifications • violations of data types, schema • glitches e.g. missing, inconsistent and incomplete data, typos • The specs are inaccessible : complex, lack of metadata, hidden rules, no documentation, word of mouth, changing. • Many sources and manifestations • Manual errors, HW/SW constraints, shortcuts in data processing.

  8. Example T.Das |55536o8327 |24.95|Y|- |0.0|1000 Ted J.|555-360-8779|2000 |N|M|NY|1000 • Can we interpret the data? • What do the fields mean? • Units of measurement? Field three is Revenue. In dollars or cents? • Data glitches • Typos, multiple formats, missing / default values • Metadata and domain expertise • Field seven is Usage. Is it censored? • Field 4 is a censored flag. How to handle censored data?

  9. Missing Data & Defaults 1818|5|NA|NA|NA|NA|NA|NA|NA|NA|0.000000|0.000000|0.000000|0.000000|NA|NA|NA|NA|NA|NA|NA|NA|NA|NA|NA|NA|0.000000 …… 8560|4448|47041.000000|46969.000000|472.000000|472.000000|46636.000000|46564.000000|466.000000|466.000000 ……. 195|1081|7829073.000000|7814692.000000|25849.000000|25849.000000|10418657.000000|10405704.000000|27897.000000 ……

  10. Common Data Glitches • Systemic changes to data which are external to the recorded process. • Changes in data layout / data types • Integer becomes string, fields swap positions, etc. • Changes in scale / format • Dollars vs. euros • Temporary reversion to defaults • Failure of a processing step • Missing and default values • “0 represents both missing and default values” • Gaps in time series • Especially when records represent incremental changes.

  11. Conventional Definition of Data Quality • Accuracy • The data was recorded correctly. • Completeness • All relevant data was recorded. • Uniqueness • Entities are recorded once. • Timeliness • The data is kept up to date. • Consistency • The data agrees with itself.

  12. Problems • Not measurable • Accuracy and completeness are difficult, perhaps impossible to measure. • Rigid and static • Context independent • No accounting for what is important. • Aggregates can tolerate inaccuracies but signatures cannot • Incomplete: interpretability, accessibility, metadata, relevance to analysis, etc.? • Vague • The conventional definitions provide no guidance towards practical improvements of the data.

  13. Finding a modern definition • We need a definition of data quality which • Reflects the use of the data • Leads to improvements in processes • Is measurable (we can define metrics) • First, we need a better understanding of how and where data quality problems occur • The data quality continuum

  14. The Data Quality Continuum • Data/information is not static, it flows in a data collection and usage process • Data gathering • Data delivery • Data storage • Data integration • Data retrieval • Data mining/analysis • Problems can and do arise at all of these stages • End-to-end, continuous monitoring needed

  15. Data Gathering • How does the data enter the system? • Sources of problems: • Manual entry • No uniform standards for content and formats • Parallel data entry (duplicates) • Approximations, surrogates – SW/HW constraints • Measurement errors.

  16. Solutions • Potential Solutions: • Preemptive: • Process architecture (build in integrity checks) • Process management (reward accurate data entry, data sharing, documentation, data stewards) • Retrospective: • Cleaning focus (duplicate removal, merge/purge, name & address matching, field value standardization) • Diagnostic focus (automated detection of glitches).

  17. Data Delivery • Data sent from source to destination • hops • Destroying or mutilating information by inappropriate pre-processing • Inappropriate aggregation • Nulls converted to default values • Loss of data: • Buffer overflows • Transmission problems • No checks • Did all the files arrive in their entirety?

  18. Solutions • Build reliable transmission protocols • Use a relay server • Verification • Checksums, verification parser • Do the uploaded files fit an expected pattern? • Relationships • Are there dependencies between data streams and processing steps • Interface agreements • Data quality commitment from the data stream supplier.

  19. Data Storage • Problems in physical storage • Can be an issue, but terabytes are cheap. • Problems in logical storage • Poor metadata. • Data feeds are often derived from application programs or legacy data sources. • Inappropriate data models. • Missing timestamps, incorrect normalization, etc. • Ad-hoc modifications. • Structure the data to fit the GUI. • Hardware / software constraints. • Data transmission via Excel spreadsheets, Y2K

  20. Solutions • Metadata • Document and publish data specifications in real time. • Planning • Provide for worst case scenarios. • Size the resources to the data feeds. • Data exploration • Use data browsing and data mining tools to examine the data. • Does it meet the specifications? • Has something changed? • Departure from expected values and process?

  21. Data Integration • Combine data sets (acquisitions, across departments). • Common source of problems • Heterogeneous data : no common key, different field formats • Approximate matching (e.g., names and addresses) • Different definitions • What is a customer: an account, an individual, a contract … • Time synchronization • Does the data relate to the same time periods? Are the time windows compatible? • Legacy data • IMS, spreadsheets, ad-hoc structures • Sociological factors • Reluctance to share – loss of power.

  22. Solutions • Commercial Tools • Significant body of research in data integration • Many tools for address matching, schema mapping are available. • Data browsing and integration • Many hidden problems and meanings : extract metadata. • View before and after results : anomalies in integration? • Manage people and processes • End-to-end accountability: data steward? • Reward data sharing and data maintenance

  23. Data Retrieval • Exported data sets are often an extract of the actual data. Problems occur because: • Source data not properly understood. • Need for derived data not understood. • Simple mistakes. • Inner join vs. outer join • Understanding NULL values • Computational constraints • E.g., too expensive to give a full history, instead use a snapshot. • Incompatibility • EBCDIC?

  24. Solutions • Tools – use appropriate ETL, EDM and XML tools • Testing – Test queries to make sure that the result matches with what is expected • Plan ahead – Make sure that the retrieved data can be stored, delivered as per specifications

  25. Data Mining and Analysis • Data collected for analysis and mining • Problems in the analysis. • Scale and performance • Confidence bounds? • Black boxes and dart boards • “Don’t need no analysts” • Attachment to models • Insufficient domain expertise • Casual empiricism

  26. Solutions • Data exploration • Determine which models and techniques are appropriate, find data bugs, develop domain expertise. • Continuous analysis • Are the results stable? How and why do they change? • Accountability • Validate analysis • Make the analysis part of the feedback loop to improve data processes

  27. The Meaning of Data Quality (2)

  28. Meaning of Data Quality Revisited • Conventional definitions: completeness, uniqueness, consistency, accuracy etc. – measurable? • Modernize definition of DQ in the context of the DQ continuum • Depends on data paradigms (data gathering, storage) • Federated data • High dimensional data • Descriptive data • Longitudinal data • Streaming data • Web (scraped) data • Numeric vs. categorical vs. text data

  29. DQ Meaning Revisited • Depends on applications (delivery, integration, analysis) • Business operations • Aggregate analysis, prediction • Customer relations … • Data Interpretation • Know all the rules used to generate the data • Data Suitability • Use of proxy data • Relevant data is missing Increased DQ  Increased reliability and usability (directionally correct)

  30. Data Quality Metrics

  31. Data Quality Constraints • Many data quality problems can be captured by static constraints based on the schema. • Nulls not allowed, field domains, foreign key constraints, etc. • Many others are due to problems in workflow, and can be captured by dynamic constraints • E.g., orders above $200 are processed by Biller 2 • The constraints follow an 80-20 rule • A few constraints capture most cases, thousands of constraints to capture the last few cases. • Constraints are measurable. Data Quality Metrics?

  32. Data Quality Metrics • Need to quantify data quality • DQ is complex, no set of numbers will be perfect • Context and domain dependent • Should indicate what is wrong and how to improve • Should be measurable, practical and implementable • Types of metrics • Static vs. dynamic constraints • Operational vs. diagnostic • Metrics should be directionally correct with an improvement in use of the data. • A very large number of metrics are possible • Choose the most important ones depending on context.

  33. Examples of Data Quality Metrics • Usability and reliability of the data • Conformance to schema (static) • Evaluate constraints on a snapshot. • Conformance to business rules (dynamic) • Evaluate constraints on changes in the database • Across time or across databases. • Accuracy • Perform inventory (expensive), or use proxy (track complaints). • Accessibility, Interpretability • Glitches in analysis • Successful completion of end-to-end process • Increase in automation, others …

  34. Technical Tools

  35. Technical Approaches • Need a multi-disciplinary approach • No single approach solves all problems • Process management • Pertains to data process and flows • Checks and controls, audits • Database • Storage, access, manipulation and retrieval • Metadata / domain expertise • Interpretation and understanding • Statistics • Analysis, diagnosis, model fitting, prediction, decision making …

  36. Process Management • Business processes which encourage DQ • Assign dollars to quality problems • Standardize content and formats • Enter data once, enter it correctly (incentives for sales, customer care) • Automation • Assign responsibility: data stewards • Continuous end-to-end data audits and reviews • Transitions between organizations. • Data Monitoring • Data Publishing • Feedback loops

  37. Data Monitoring • Data processing systems are often thought of as open-loop systems. • Process and forget … • Computers don’t make mistakes, do they? • Analogy to control systems : feedback loops. • Monitor the system to detect difference between actual and intended • Feedback loop to correct the behavior of earlier components

  38. Example • Sales, provisioning, and billing for telecommunications service • Many stages involving handoffs between organizations and databases • Simplified picture • Transition between organizational boundaries is a common cause of problems. • Natural feedback loops • Customer complains if the bill is too high • Missing feedback loops • No complaints if we undercharge.

  39. Example Sales Order Customer Customer Care Customer Account Information Billing Provisioning Existing Data Flow Missing Data Flow

  40. Data Monitoring • Use data monitoring to add missing feedback loops. • Methods: • Data tracking / auditing • Follow a sample of transactions through the workflow. • Reconciliation of incrementally updated databases with original sources. • Mandated consistency with a Database of Record (DBOR). • Data Publishing

  41. Data Publishing • Make the contents of a database available in a readily accessible and digestible way • Web interface (universal client). • Data Squashing : Publish aggregates, cubes, samples, parametric representations. • Publish the metadata. • Close feedback loops • Many people look at data, use different sections for different purposes in different ways, “test” the data • Surprisingly difficult sometimes. • Organizational boundaries, loss of control interpreted as loss of power, desire to hide problems.

  42. Databases Technology • Why use databases? • Statisticians spend a lot of time on EDA, sanity checks and summarization. • Powerful data analysis and query tools. • Extensive data import/export/access facilities. • Data validation. • Integration of data from multiple sources. • Most data lives in databases

  43. Relational Databases • A database is a collection of tables. • Each table is a collection of records. • Each record contains values of named fields. • The values can be NULL, • Meaning either “don’t know” or “not applicable”. • A key is a field (or set of fields) whose value is unique in every record of a table • Identifies the thing described by the record • Data from different tables is associated by matching on field values (join). • Foreign key join: all values of one field are contained in the set of values of another field, which is a key.

  44. Name ID BaseSalary Commission Joe 1101 10,000 14% Ted 1113 20,000 10% Mary 1211 15,000 12% Sunita 1514 8,000 15% ID SalesForceID SaleDate DeliveryDate 22122 1101 8/3/2002 8/9/2002 22124 1514 8/8/2002 8/23/2002 22325 1211 8/15/2002 NULL 23001 1514 8/24/2002 9/4/2002 SalesForce Orders Foreign Key

  45. SQL(Structured Query Language) How many sales are pending delivery, by salesperson? Specify attributes Specify data sources Select S.Name, count(*) From SalesForce S, Orders O Where S.ID = O.SalesForceID And O.DeliveryDate IS NULL Group By S.Name Integrate DeliveryDate is NULL means that it is pending Count by salesperson name

  46. Database Tools • Most DBMS’s provide many data consistency tools • Data types • String, date, float, integer • Domains (restricted set of field values) • Restriction on field values e.g. telephone number • Constraints • Column Constraints • Not Null, Unique, Restriction of values • Table constraints • Primary and foreign key constraints • Powerful query language • Triggers • Timestamps, temporal DBMS

  47. Then why is every DB dirty? • Consistency constraints are often not used • Cost of enforcing the constraint • E.g., foreign key constraints, triggers. • Loss of flexibility • Constraints not understood • E.g., large, complex databases with rapidly changing requirements • DBA does not know / does not care. • Complex, heterogeneous, poorly understood data • Merged, federated, web-scraped DBs. • Undetectable problems • Incorrect values, missing data • Metadata not maintained • Database is too complex to understand

  48. Semantic Complexity • Different ideas about exactly what the data represents leads to errors. • Example: • HR uses the SalesForce table to record the current status of the sales staff. When a person leaves employment, their record is deleted. • The CFO uses the SalesForce and Orders tables to compute the volume of sales each quarter. • Strangely, their numbers are always too low … • Enforcing foreign key join by deletion will drop records from Orders table which is even worse. • CFO needs historical view of Salesforce table to get accurate answers • DQ problems arise when information is not fully communicated to the users

More Related