390 likes | 441 Vues
Data Quality. Class 2 David Loshin. Goals. Overview of Databases Cost of low data quality The information chain Use of Mini Tools. Overview of Databases. Data Model Tables and Attributes Relational Databases. Data Model. A representation of an object
E N D
Data Quality Class 2 David Loshin
Goals • Overview of Databases • Cost of low data quality • The information chain • Use of Mini Tools
Overview of Databases • Data Model • Tables and Attributes • Relational Databases
Data Model • A representation of an object • Describes the properties associated with a modeled entity • Also describes how those properties are represented
Tables and Attributes • A collection of object instances reside in a single table • Each instance is represented as a row in the table • Each property is manifested as a column in the table
Relational Databases • We attempt to manage data in “normal form” • Data in different tables are related via “keys” • A key is a set of one or more attributes that uniquely identify an entity within a table • Tables are related to each other via foreign keys
Relational Databases con’t. • Referential Integrity Constraint – If a value is used as a foreign key to a different table, a record must exist in that table that has that value as its primary key • Functional Dependence – An attribute B is functionally dependent on attribute A if for any distinct value j of A, there is a corresponding value k of B, then in all instances where A contains j B will contain k
Databases and Data Quality • Data quality implies validation of all integral constraints associated with a database • Existence of a primary key • Referential Integrity • Null value constraints • Functional Dependence • Data quality will also encompass higher level content-oriented rules
Cost of Low Data Quality • Data quality is measured using anecdotes • “Hazy” feeling of wrongness • Desire to gauge the true cost of poor data quality
Evidence of Economic Impact • Frequent service interruptions and system failures • Drop in productivity vs. volume • High employee turnover • High new business/continued business ratio • Increased customer service requirements • Customer Attrition
The Data Quality Scorecard • Use scorecard as a tool to manage the corporate information asset • Precise methods to measure level of data quality • Evaluate the costs and impacts associated with low data quality • Build a ROI model Knowledge Integrity Incorporated
Building and Using the Data Quality Scorecard • Map the flow of information • Find the critical points of pain • Locate the origin of the problems • Identify the impacts • Calculate the cost • Identify targets for improvement Knowledge Integrity Incorporated
Map the Flow of Information • Data processing can be likened to an “information factory” • Data sets from multiple sources are used as “raw input” • Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated
Stages in the Information Map • Data Supply • Data Acquisition • Data Creation • Data Processing • Data Packaging • Decision Making • Decision Implementation • Data Delivery • Data Consumption Knowledge Integrity Incorporated
Directed Information Channels • Indicates the flow of information from one processing stage to another • Example: supplier data is delivered to an acquisition stage through an information channel • Directed indicates the direction in which data flows • This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated
Find the Critical Points of Pain • Look for evidence of impact • Frequent system failures • Drops in productivity • High employee turnover • Increased customer service requirements • Inability to scale • Decreased margins • Customer attrition Knowledge Integrity Incorporated
Customer Interviews • Ask about potential for data errors associated with: • Billing • Customer service • Attrition • Recommendations • Ask about customer perception of the organization Knowledge Integrity Incorporated
Employee Interviews • Look for instances where low data quality affects smooth operation • Seek out scrap and rework: • Where do data problems affect ability to do job? • How often must processes be rerun due to data problems? • How much time is spent fixing data problems? • How does error correction scale within organization? • What keeps employees from being able to get their job done successfully and on time? Knowledge Integrity Incorporated
Preliminary Expectations • Early in process for formal definition of rules • Not too early for “gross-level” statement of expectations • Example: All addresses must contain street name, city, state, and ZIP code Knowledge Integrity Incorporated
Initial Assessment • Simple tests for non-conformance to expectations • Highest level assessment may be done using sampling • Get a gross-level score for conformance • Define rules • Test • Measure • Score Knowledge Integrity Incorporated
Isolate Flawed Data • Where are data problems recognized? • Who finds the problem data? • Customers? • Call center personnel? • Internal Knowledge Workers? • What are the workers’ rewards for finding bad data? Knowledge Integrity Incorporated
Trace Back to Origin of Fault • Follow path of information from its insertion into information flow through its exit points • Trace backward through the information chain to find the point at which the information became flawed. Knowledge Integrity Incorporated
Identify the Impacts • The quality of information at any consumption stage can affect any of these variables: • Increase in Profit • Decrease in Profit • Cost increase • Cost decrease • Delay • Speedup • Increased satisfaction • Decreased satisfaction Knowledge Integrity Incorporated
Soft Impacts • Difficulty in decision-making • Time delays in operation • Organizational mistrust • Lowered ability to compete • Data ownership conflicts • Lowered customer satisfaction • Lowered employee satisfaction Knowledge Integrity Incorporated
Hard Impacts • Customer attrition • Scrap and rework • Error prevention • Increased customer service costs • Costs associated with fixing customer problems • Spin control • Loss of equity value • Enterprisewide data inconsistency Knowledge Integrity Incorporated
Calculate the Costs • Some impacts are easily tied to hard $$$ • Some impacts are hard to characterize exactly, but are clearly felt • Assign some cost to each problem Knowledge Integrity Incorporated
Cost Categories • Detection • Correction • Rollback • Rework • Prevention • Warranty • Reduction • Attrition • Blockading Knowledge Integrity Incorporated
Information Chains and Data Flows • Multiple impacts may be attributed to the same data problem • Tracing problems back through the information chain provides insight into overall impact of poor data quality Knowledge Integrity Incorporated
The Assessment Matrix • Axis 1: • Log each data quality problem • Axis 2: • Specify activities associated with each problem • Axis 3: • Impact areas for each activity • Each cell contains the estimated cost associated with the impact Knowledge Integrity Incorporated
Aggregate Costs, Build the Model • Superimpose matrix onto spreadsheet • Tally and summarize across the model • Use the spreadsheet as a simulation model Knowledge Integrity Incorporated
Putting it Together • Map the information chain • Conduct interviews to locate data quality problems • Annotate information chain with location of data quality problems • Identify impact domains for each problem • Characterize economic impact (=cost!) • Aggregate totals
The Information Chain • Represented as a directed graph • Vertices are processing stages • Edges are directed channels • At each intersection point, we manage a collection of objects representing data objects passing through that intersection
Information Chain, con’t. • Each intersection data object contains: • The model of the data passing through the intersection • A set of rules describing validity for those data • The named reference objects that are related to the data passing through • At each point in the information chain, we can measure conformance to our data quality validation criteria
Information Chain, 3 • We can model the information chain in a database • Table for vertices • Table for edges • Table for data objects • Table for rules • Table for reference objects • See the book for details
Data Quality: Using the Tools • Example data: Hierarchical Department name data • Source: 2 data sources • Goal: If we wanted to go with 1 data source, how would it impact the other?
Example, con’t. • Goals: • Determine overlap between source A and source B • Determine what is not is intersection between 2 data sources • Look for duplicates that are and are not exact matches