250 likes | 403 Vues
This document outlines essential aspects of data quality and governance, emphasizing the internal data project review and the role of database analysts. It covers various types of data, including transactional and external data sources, as well as criteria for good data quality such as accuracy, consistency, and accessibility. The document also details ETL activities, data cleansing processes, and the responsibilities of high-level committees in data governance. Additionally, it questions the necessity of ETL in the era of big data and discusses how data management practices adapt to emerging technologies.
E N D
Agenda 03/27/2014 • Review first test. • Discuss internal data project. • Review characteristics of data quality. • Types of data. • Data quality. • Data governance. • Define ETL activities. • Discuss database analyst/programmer responsibilities for data evaluation.
Discussed in prior classes... • Lots of data. • Traditional transaction processing systems • Non-traditional data • Call center; Click-stream; Loyalty card; Warranty cards/product registration information, email, twitter, Facebook • External data from government and commercial entities • General classification of data • Transaction data • Referential data/master data • Metadata
Data quality • What is good quality data? • Correct • Accurate • Consistent • Complete • Available • Accessible • Timely • Relevant
How does data “go bad”? Does all “bad” data have to be fixed?
Data governance • Policies, processes and procedures aimed at managing the data in an organization. • Usually high-level cross-department committees that oversee data management across the organization. • Responsible for defining what data is necessary to gather. • Responsible for defining the source and store of data. • Responsible for security policies, processes, procedures. • Responsible for creating the policies, processes and procedures. • Responsible for assigning blame. • Responsible for enforcing policies.
Data quality in data warehouses • Is it more important than data quality in source transaction and reference data? • How is better quality data achieved? • Automated ETL processes to populate the data warehouse • Spot checking programmatically
Populating the data warehouse • Extract • Take data from source systems. • May require middleware to gather all necessary data. • Transformation • Put data into consistent format and content. • Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules. • Convert data as necessary. • Load • Use a batch (bulk) update operation that keeps track of what is loaded, where, when and how. • Keep a detailed load log to audit updates to the data warehouse.
Data Cleansing • Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary to very sophisticated data cleansing capabilities • Industry-specific data cleansing software is often used. Important for performing name and address correction • Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Informatica, Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (SAP-AG)
Steps in data cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating
Parsing • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting • Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Source system view – 3 clients Account# 1238891 Policy No.ME309451-2 Transaction B498/97
The reality – ONE client Account# 1238891 Policy No.ME309451-2 Transaction B498/97
William Lewis Beth Parker Karen Parker-Lewis William Parker-Lewis, Jr. Consolidating whole groups
ETL Products • SQL Server 2012 Integration Services from Microsoft • Power Mart/Power Center/Power Exchange from Informatica • Warehouse Builder from Oracle • Teradata Warehouse Builder from Teradata • DataMigrator from Information Builders • SAS System from SAS Institute • Connectivity Solutions from OpenText • Ab Initio
ETL Goal: Data is complete, accurate, consistent, and in conformance with the business rules of the organization. Questions: • Is ETL really necessary? • Has the advent of big data changed our need for ETL? • ETL vs. ELT • Does the use of Hadoop eliminate the need for ETL software??? • Does it matter if the data is stored in the “cloud”?