Data Warehousing 1. Introduction and Overview
Reading • Ponniah (2001), Data Warehousing Fundamentals. • Simsion (2001), Data Modelling Essentials • Date (2004) Introduction to Database Systems
The compelling need for DW • The need for strategic information • Information crisis in organisations • Operational vs informational systems • Why past attempts at strategic information failed • Data Warehousing as the viable solution
Brief history • 1960s – focus on business operations • Inventory, billing, ledger, payroll, stock, accounts, insurance claims... Etc. • These systems became commodified • Hundreds of applications needed to support business operations • Characteristics – they gather, store and process data for daily operations
Enter the 90s • Business grew more complex and distributed • Globalization • Competition fierce • The age of Strategy • Executives wanted different information
Informational requirements • Retail • Customer loyalty • Market planning • Market segmentation • Financial • Risk management • Fraud detection • Airlines • Route profitability
Data warehouse as a paradigm • Who needs strategic information? • What do we mean by strategic information? • Can businesses maintain a strategic advantage? • How can information support the business objectives?
Business objectives !(different to business rules) • High quality data required to help execs formulate objectives. Eg. • Retain present customer base • Increase cust base by 15% over next 5 years • Gain market share by 10% over next 3 years • Organisations have lots of data across various functions within the business • New web based environments, older legacy environments
So, what is a DW? • A database designed to meet strategic, decision support requirements • Subject oriented (eg. Sales, Markets) • Integrated (various different RDBMS sources) • Time-dependant (historical, going back many years) • Non-volatile (no updates or deletes, DW’s are “loaded”) • Will elaborate on these more later.
Failures of the past • Management typically makes a request (strategic data) • Source of information for the request is spread across various platforms and systems • Gathering this information is ad-hoc, expensive, time consuming and tricky. • Reports are not “exactly” what was required, more ad hoc mods to improve the data • Rough attempts to standardise the process • Requirements not fully understood
Historical approaches • Ad hoc report based approach • Specialised extraction programs • Small applications • DSS • Executive information systems • Data stores (get the operational data out of the operational environment) • IT Dept overloaded, requests too numerous, users get into a merry go round asking for more and more supplementary reports, users depend on IT for support, information not entirely suitable for strategic decision making.
Data Warehousing – the solution • Database designed specifically for analytical tasks • Data from multiple environments • Easy to use, conducive to long interactive sessions by users • Read intensive data usage • Direct interaction with system without IT dept involvement • Content updated periodically and stable • Content current and historical • Users run their own queries (point and click) • User initiate their own reports
Processing requirements • Simple queries and reports against current and historical data – for eg. Total sales, sales across the regions for financial periods etc. • “What if” analysis • Drill down query analysis • Trends and future predictive analysis
An environment – not a product • Data warehouse is a collection of tools and techniques • Data extraction, processing, cleansing, scrubbing, transformation and loading • Warehouse DB a specialised DBMS installation • OLAP, OnLine Analytical Processing / DSS – front end that is 100% user driven • Provides ability to creatively discover new and interesting information to complex and unpredictable questions • A blend of many different technologies
The building blocks • Subject oriented • Integrated • Time dependant • Nonvolatile
Subject Oriented • Data stored by subject rather than business function. • This means data could be sourced from various different “cross-functional” boundaries throughout the organisation • For example “CLAIMS” could be a critical subject area for the Insurance industry. • “SALES” for the retail industry • The DB is not tied to the application as is usually the case in Operational or transactional databases.
Intergrated • This refers to pulling together data from various areas within the organisation. • Part of the challenge is standardising data that may be represented (duplicated) across systems but stored in subtly different ways. • Big challenge to remove these inconsistencies (we talk about this in a couple of weeks – data quality). • Naming conventions • Data type standardisation • Attributes • Measurements
Time variance • Operational systems typically focus on current or near current data • Data warehouses typically store data going back many many years – even decades. • Data is stored as a snapshot over past and more recent periods. • This aspect of DW is significant in the design and implementation phases • Allows for analysis of the past, relates to the present and enables forecasting for the future
Nonvolatile data • In operational and transactional systems data is volatile – update, insert and deletes are occurring all the time • Data from operational systems is moved into the data warehouse at specific intervals (weekly, daily, monthly etc.) this load cycle is determined during the analysis phase. • Business transactions do not update the warehouse real time. • As such data in data warehouses is not updated or deleted.
Granularity • Operational systems store data at the lowest level of granularity • Individual sales, individual items. • If summary data is required, we add up the sales, items etc. • DW keeps summarised data at different levels. • City, region, individual store • Decisions (at analysis) phase on what level of granularity to store – lowest level requires a lot of storage space and increasing processing requirements
Data Mart • Sometimes seen as a mini DW • Usually departmental rather than enterprise wide • More specific subject matter • Quicker (cheaper) to develop • Sometimes used as a proof of concept • Many Data Marts can be combined to produce a single DW.
Approaches • Top Down • Corporate/Enterprise view • Architected • Centralised rules and control • Single centralised storage of data • Can take longer to build • High exposure / risk • High level cross functional skills required • Expensive
Approaches • Bottom up • Proof of concept (Data Mart) • Faster, more manageable • ROI clearer • Less risk • Incremental • Narrower view of data • Redundancy across multiple D-marts • Associated issues with above...
A Practical Approach • Plan and define requirements at the corporate level • Create a surrounding architecture • Conform and standardise the data content • Implement warehouse as a series of supermarts, one at a time. • Supermart – architected datamart with the view of becoming part of the corporate warehouse (enterprise focus)
Overview of Components • Data • Staging • Storage • Information Delivery • Metadata • Management and Control
Data • Source Data (generally four categories) • Production Data • From txn / operational RDBMS • Problem is disparity across prod data – standardising, transforming , converting and integrating is a significant job • Internal Data • Could be in spreadsheets, documents, profiles etc. generally “private” ie used by individuals for specific jobs • Archived (historical) • Different methods for archiving data exist and these change over the years, could be on tapes, flat file format (CSV), microfilm, could be off site. Newer data could be online or external drives • External • 3rd party – for example mapping and cadastral data - often won’t conform to org’s data format or standards
Data Staging • Consists of three processes • Data extraction • Have to deal with multiple sources as we have seen • Some legacy system data, perhaps flat files, hierarchical, spreadsheets and third party data sets • Data is extracted and moved to a staging area where it can be processed further • Data Transformation • Cleaning up data from each data source, getting rid of data not required • Combining some data depending on “granularity” required and standards that have been created • When complete, you have a collection of integrated data that is standardised and summarised • Data Loading • Initial load of the warehouse • Then incremental loads based on cycle (daily, weekly etc.)
Storage • The Data Warehouse is a totally separate repository • Usually an RDBMS environment • Warehouse holds very large volumes of historical data for analysis • Structures are subtly different to relational structures in that they are suitable for intensive analysis • Read only • Many warehouse’s employ MDDBMs (multidimensional) data is aggregated across many dimensions and summary (derived) data is also held
Comparisons Transactional DBs • Current, real time data • Atomic detail • No summarisation • Normalised • Update and Insert txns • High volume update, speed critical • Query activity usually low and simple • Low – med data volumes Data warehouse • Historic, time oriented • Summarised • Aggregation / granularity • Not always normalised, sometimes multidimensional • Query oriented – no updates (data is loaded periodically) • Retrieval critical • Query activity high, complexity very high • Very high volumes of data
Information Delivery • Interfaces to provide • Ad hoc reporting • Preset queries and pre-summarised popular data sets • Ability for very complex queries • MD analysis • Statistical • EIS and Data Mining/Knowledge discovery functions • Sometimes called OLAP (online analytical processing)
Metadata • Similar to data dictionary / catalog • Operational metadata • Extract and transform metadata • End-user metadata • Provides information about content and structure of the warehouse • Opens the door for users regarding the content making it useable and recognizable
Management and Control • DBA central ! • All the control activities and software that the specialised warehouse DBA uses to monitor and coordinate activities related to the warehouse, extract, transform and load, manage metadata, summary tables, the MDDBMs etc.
A reminder : why operational systems are not suitable for providing strategic information • Complex ad-hoc query often required • Users wish to interact (mess about with) the data based on previous requests • This compromises the purpose and function of operational database environments. • Logical structure of the operational environment not conducive to complex analysis required • Data requirements are not complete (pools of information problem)
Summary... • DW is not a product. An environment • Data from multiple application environments • Provides and supports business intelligence • Total integrated view of the enterprise • Does not hinder operational environments • Ensures quality consistency in organisations data/information
Technology Summary • Data from many and varied operational environments – the basic business processes • Extraction, cleansing, aggregation and loading – Data Transformation (ETL) • Key business measurements are determined, these are the “Business Dimensions” (locations, products, time/date, customer) • These are modelled against “Facts” (sales) • Data is loaded periodically via ETL into the warehouse.
Supporting Technologies • OLAP, Online analytical processing • Multidimensional (cube) data modelling technologies • Data mining • Metadata management • Report and query facilities • Information delivery / web based presentation toolsets
Trends • Data warehouse is becoming mainstream • Driven by fierce competition and deregulation • Goal is to get some strategic advantage through the use of knowledge • Imperative for customised marketing • The vendors have largely got their act together (SAP, SAS, Oracle, Business Objects, IBM, Peoplesoft, Baan etc._ • Support for multiple data types
Trends cont. • Spatial data integration • Data visualisation • Parallel processing and cluster technologies • Support for very large data volumes (VLDB) • Syndicated data and business intelligence (data enrichment) • System and supply chain integration software • ERP, SCM, CRM
ERP • Integrates activities such as finance, accounting; administration; operations planning,execution procurement, human resources, product development, inventory management, research & development. • Serves the needs of everyone in the organization from finance to the (physical) warehouse. • Runs from a centralized database • Overcomes the “silos of information” effect. • Supports DW
SCM • Connecting Demand with Supply • Deals with the management of information related to the flow of material along the entire production cycle. • Cycle begins with procurement of raw material > production for finished good > ends with the delivery of the finished goods • Typical activities in SCM include: logistics; distribution planning; demand planning, forecasting; and warehouse/inventory management
CRM • Designed to increase customer intimacy and loyalty. • Multiple avenues are provided for feedback from the customers including: phone, fax, email, and web-forms. All such information is stored in a central repository that can easily be accessed by everybody in the company using simple web applications. • Typical functions performed in this area include: sales channel management; marketing automation; customer data management; and service personalization.