Data Quality: Managing Information Integrity Effectively

Data Quality Class 2 David Loshin

Goals • Overview of Databases • Cost of low data quality • The information chain • Use of Mini Tools

Overview of Databases • Data Model • Tables and Attributes • Relational Databases

Data Model • A representation of an object • Describes the properties associated with a modeled entity • Also describes how those properties are represented

Tables and Attributes • A collection of object instances reside in a single table • Each instance is represented as a row in the table • Each property is manifested as a column in the table

Relational Databases • We attempt to manage data in “normal form” • Data in different tables are related via “keys” • A key is a set of one or more attributes that uniquely identify an entity within a table • Tables are related to each other via foreign keys

Relational Databases con’t. • Referential Integrity Constraint – If a value is used as a foreign key to a different table, a record must exist in that table that has that value as its primary key • Functional Dependence – An attribute B is functionally dependent on attribute A if for any distinct value j of A, there is a corresponding value k of B, then in all instances where A contains j B will contain k

Databases and Data Quality • Data quality implies validation of all integral constraints associated with a database • Existence of a primary key • Referential Integrity • Null value constraints • Functional Dependence • Data quality will also encompass higher level content-oriented rules

Cost of Low Data Quality • Data quality is measured using anecdotes • “Hazy” feeling of wrongness • Desire to gauge the true cost of poor data quality

Evidence of Economic Impact • Frequent service interruptions and system failures • Drop in productivity vs. volume • High employee turnover • High new business/continued business ratio • Increased customer service requirements • Customer Attrition

The Data Quality Scorecard • Use scorecard as a tool to manage the corporate information asset • Precise methods to measure level of data quality • Evaluate the costs and impacts associated with low data quality • Build a ROI model Knowledge Integrity Incorporated

Building and Using the Data Quality Scorecard • Map the flow of information • Find the critical points of pain • Locate the origin of the problems • Identify the impacts • Calculate the cost • Identify targets for improvement Knowledge Integrity Incorporated

Map the Flow of Information • Data processing can be likened to an “information factory” • Data sets from multiple sources are used as “raw input” • Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated

Stages in the Information Map • Data Supply • Data Acquisition • Data Creation • Data Processing • Data Packaging • Decision Making • Decision Implementation • Data Delivery • Data Consumption Knowledge Integrity Incorporated

Directed Information Channels • Indicates the flow of information from one processing stage to another • Example: supplier data is delivered to an acquisition stage through an information channel • Directed indicates the direction in which data flows • This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated

Find the Critical Points of Pain • Look for evidence of impact • Frequent system failures • Drops in productivity • High employee turnover • Increased customer service requirements • Inability to scale • Decreased margins • Customer attrition Knowledge Integrity Incorporated

Customer Interviews • Ask about potential for data errors associated with: • Billing • Customer service • Attrition • Recommendations • Ask about customer perception of the organization Knowledge Integrity Incorporated

Employee Interviews • Look for instances where low data quality affects smooth operation • Seek out scrap and rework: • Where do data problems affect ability to do job? • How often must processes be rerun due to data problems? • How much time is spent fixing data problems? • How does error correction scale within organization? • What keeps employees from being able to get their job done successfully and on time? Knowledge Integrity Incorporated

Preliminary Expectations • Early in process for formal definition of rules • Not too early for “gross-level” statement of expectations • Example: All addresses must contain street name, city, state, and ZIP code Knowledge Integrity Incorporated

Initial Assessment • Simple tests for non-conformance to expectations • Highest level assessment may be done using sampling • Get a gross-level score for conformance • Define rules • Test • Measure • Score Knowledge Integrity Incorporated

Isolate Flawed Data • Where are data problems recognized? • Who finds the problem data? • Customers? • Call center personnel? • Internal Knowledge Workers? • What are the workers’ rewards for finding bad data? Knowledge Integrity Incorporated

Trace Back to Origin of Fault • Follow path of information from its insertion into information flow through its exit points • Trace backward through the information chain to find the point at which the information became flawed. Knowledge Integrity Incorporated

Identify the Impacts • The quality of information at any consumption stage can affect any of these variables: • Increase in Profit • Decrease in Profit • Cost increase • Cost decrease • Delay • Speedup • Increased satisfaction • Decreased satisfaction Knowledge Integrity Incorporated

Soft Impacts • Difficulty in decision-making • Time delays in operation • Organizational mistrust • Lowered ability to compete • Data ownership conflicts • Lowered customer satisfaction • Lowered employee satisfaction Knowledge Integrity Incorporated

Hard Impacts • Customer attrition • Scrap and rework • Error prevention • Increased customer service costs • Costs associated with fixing customer problems • Spin control • Loss of equity value • Enterprisewide data inconsistency Knowledge Integrity Incorporated

Calculate the Costs • Some impacts are easily tied to hard $$$ • Some impacts are hard to characterize exactly, but are clearly felt • Assign some cost to each problem Knowledge Integrity Incorporated

Cost Categories • Detection • Correction • Rollback • Rework • Prevention • Warranty • Reduction • Attrition • Blockading Knowledge Integrity Incorporated

Information Chains and Data Flows • Multiple impacts may be attributed to the same data problem • Tracing problems back through the information chain provides insight into overall impact of poor data quality Knowledge Integrity Incorporated

The Assessment Matrix • Axis 1: • Log each data quality problem • Axis 2: • Specify activities associated with each problem • Axis 3: • Impact areas for each activity • Each cell contains the estimated cost associated with the impact Knowledge Integrity Incorporated

Aggregate Costs, Build the Model • Superimpose matrix onto spreadsheet • Tally and summarize across the model • Use the spreadsheet as a simulation model Knowledge Integrity Incorporated

Putting it Together • Map the information chain • Conduct interviews to locate data quality problems • Annotate information chain with location of data quality problems • Identify impact domains for each problem • Characterize economic impact (=cost!) • Aggregate totals

The Information Chain • Represented as a directed graph • Vertices are processing stages • Edges are directed channels • At each intersection point, we manage a collection of objects representing data objects passing through that intersection

Information Chain, con’t. • Each intersection data object contains: • The model of the data passing through the intersection • A set of rules describing validity for those data • The named reference objects that are related to the data passing through • At each point in the information chain, we can measure conformance to our data quality validation criteria

Information Chain, 3 • We can model the information chain in a database • Table for vertices • Table for edges • Table for data objects • Table for rules • Table for reference objects • See the book for details

Data Quality: Using the Tools • Example data: Hierarchical Department name data • Source: 2 data sources • Goal: If we wanted to go with 1 data source, how would it impact the other?

Example, con’t. • Goals: • Determine overlap between source A and source B • Determine what is not is intersection between 2 data sources • Look for duplicates that are and are not exact matches

Data Quality: Managing Information Integrity Effectively

Data Quality: Managing Information Integrity Effectively

Presentation Transcript

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality

Data Quality