1 / 36

Data Quality

Data Quality. Class 2 David Loshin. Goals. Overview of Databases Cost of low data quality The information chain Use of Mini Tools. Overview of Databases. Data Model Tables and Attributes Relational Databases. Data Model. A representation of an object

babell
Télécharger la présentation

Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality Class 2 David Loshin

  2. Goals • Overview of Databases • Cost of low data quality • The information chain • Use of Mini Tools

  3. Overview of Databases • Data Model • Tables and Attributes • Relational Databases

  4. Data Model • A representation of an object • Describes the properties associated with a modeled entity • Also describes how those properties are represented

  5. Tables and Attributes • A collection of object instances reside in a single table • Each instance is represented as a row in the table • Each property is manifested as a column in the table

  6. Relational Databases • We attempt to manage data in “normal form” • Data in different tables are related via “keys” • A key is a set of one or more attributes that uniquely identify an entity within a table • Tables are related to each other via foreign keys

  7. Relational Databases con’t. • Referential Integrity Constraint – If a value is used as a foreign key to a different table, a record must exist in that table that has that value as its primary key • Functional Dependence – An attribute B is functionally dependent on attribute A if for any distinct value j of A, there is a corresponding value k of B, then in all instances where A contains j B will contain k

  8. Databases and Data Quality • Data quality implies validation of all integral constraints associated with a database • Existence of a primary key • Referential Integrity • Null value constraints • Functional Dependence • Data quality will also encompass higher level content-oriented rules

  9. Cost of Low Data Quality • Data quality is measured using anecdotes • “Hazy” feeling of wrongness • Desire to gauge the true cost of poor data quality

  10. Evidence of Economic Impact • Frequent service interruptions and system failures • Drop in productivity vs. volume • High employee turnover • High new business/continued business ratio • Increased customer service requirements • Customer Attrition

  11. The Data Quality Scorecard • Use scorecard as a tool to manage the corporate information asset • Precise methods to measure level of data quality • Evaluate the costs and impacts associated with low data quality • Build a ROI model Knowledge Integrity Incorporated

  12. Building and Using the Data Quality Scorecard • Map the flow of information • Find the critical points of pain • Locate the origin of the problems • Identify the impacts • Calculate the cost • Identify targets for improvement Knowledge Integrity Incorporated

  13. Map the Flow of Information • Data processing can be likened to an “information factory” • Data sets from multiple sources are used as “raw input” • Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated

  14. Stages in the Information Map • Data Supply • Data Acquisition • Data Creation • Data Processing • Data Packaging • Decision Making • Decision Implementation • Data Delivery • Data Consumption Knowledge Integrity Incorporated

  15. Directed Information Channels • Indicates the flow of information from one processing stage to another • Example: supplier data is delivered to an acquisition stage through an information channel • Directed indicates the direction in which data flows • This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated

  16. Find the Critical Points of Pain • Look for evidence of impact • Frequent system failures • Drops in productivity • High employee turnover • Increased customer service requirements • Inability to scale • Decreased margins • Customer attrition Knowledge Integrity Incorporated

  17. Customer Interviews • Ask about potential for data errors associated with: • Billing • Customer service • Attrition • Recommendations • Ask about customer perception of the organization Knowledge Integrity Incorporated

  18. Employee Interviews • Look for instances where low data quality affects smooth operation • Seek out scrap and rework: • Where do data problems affect ability to do job? • How often must processes be rerun due to data problems? • How much time is spent fixing data problems? • How does error correction scale within organization? • What keeps employees from being able to get their job done successfully and on time? Knowledge Integrity Incorporated

  19. Preliminary Expectations • Early in process for formal definition of rules • Not too early for “gross-level” statement of expectations • Example: All addresses must contain street name, city, state, and ZIP code Knowledge Integrity Incorporated

  20. Initial Assessment • Simple tests for non-conformance to expectations • Highest level assessment may be done using sampling • Get a gross-level score for conformance • Define rules • Test • Measure • Score Knowledge Integrity Incorporated

  21. Isolate Flawed Data • Where are data problems recognized? • Who finds the problem data? • Customers? • Call center personnel? • Internal Knowledge Workers? • What are the workers’ rewards for finding bad data? Knowledge Integrity Incorporated

  22. Trace Back to Origin of Fault • Follow path of information from its insertion into information flow through its exit points • Trace backward through the information chain to find the point at which the information became flawed. Knowledge Integrity Incorporated

  23. Identify the Impacts • The quality of information at any consumption stage can affect any of these variables: • Increase in Profit • Decrease in Profit • Cost increase • Cost decrease • Delay • Speedup • Increased satisfaction • Decreased satisfaction Knowledge Integrity Incorporated

  24. Soft Impacts • Difficulty in decision-making • Time delays in operation • Organizational mistrust • Lowered ability to compete • Data ownership conflicts • Lowered customer satisfaction • Lowered employee satisfaction Knowledge Integrity Incorporated

  25. Hard Impacts • Customer attrition • Scrap and rework • Error prevention • Increased customer service costs • Costs associated with fixing customer problems • Spin control • Loss of equity value • Enterprisewide data inconsistency Knowledge Integrity Incorporated

  26. Calculate the Costs • Some impacts are easily tied to hard $$$ • Some impacts are hard to characterize exactly, but are clearly felt • Assign some cost to each problem Knowledge Integrity Incorporated

  27. Cost Categories • Detection • Correction • Rollback • Rework • Prevention • Warranty • Reduction • Attrition • Blockading Knowledge Integrity Incorporated

  28. Information Chains and Data Flows • Multiple impacts may be attributed to the same data problem • Tracing problems back through the information chain provides insight into overall impact of poor data quality Knowledge Integrity Incorporated

  29. The Assessment Matrix • Axis 1: • Log each data quality problem • Axis 2: • Specify activities associated with each problem • Axis 3: • Impact areas for each activity • Each cell contains the estimated cost associated with the impact Knowledge Integrity Incorporated

  30. Aggregate Costs, Build the Model • Superimpose matrix onto spreadsheet • Tally and summarize across the model • Use the spreadsheet as a simulation model Knowledge Integrity Incorporated

  31. Putting it Together • Map the information chain • Conduct interviews to locate data quality problems • Annotate information chain with location of data quality problems • Identify impact domains for each problem • Characterize economic impact (=cost!) • Aggregate totals

  32. The Information Chain • Represented as a directed graph • Vertices are processing stages • Edges are directed channels • At each intersection point, we manage a collection of objects representing data objects passing through that intersection

  33. Information Chain, con’t. • Each intersection data object contains: • The model of the data passing through the intersection • A set of rules describing validity for those data • The named reference objects that are related to the data passing through • At each point in the information chain, we can measure conformance to our data quality validation criteria

  34. Information Chain, 3 • We can model the information chain in a database • Table for vertices • Table for edges • Table for data objects • Table for rules • Table for reference objects • See the book for details

  35. Data Quality: Using the Tools • Example data: Hierarchical Department name data • Source: 2 data sources • Goal: If we wanted to go with 1 data source, how would it impact the other?

  36. Example, con’t. • Goals: • Determine overlap between source A and source B • Determine what is not is intersection between 2 data sources • Look for duplicates that are and are not exact matches

More Related