400 likes | 412 Vues
This lecture provides an overview of data warehousing concepts and architectures, including the data warehouse, data marts, and OLAP technology. Topics covered include DW architecture, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, and data mining.
E N D
Lecture 1 - Introduction to DW Reading Requirements: [Elmasri&Navathe1999] chapter26 or [Connolly&Begg2002] chapter30 or [Connolly&Begg1998] chapter25 [Chaudhuri&Dayal] ”An Overview of Data Warehousing and OLAP Technology” [Kimball&et.al98] Introduction, chapter1 Keywords: DW architecture, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, Data Mart • Erik Perjons • DSV, KTH/SU • 08-16 49 47 • 2002-03-11
The data warehousing architechture The back room The front room Analysis/OLAP Data warehouse External sources Extract Transform Load Refresh Serve Query/Reporting Operational DBs Data marts Data mining Falö aöldf flaöd aklöd falö alksdf Front end tools End user applications Back end tools Data staging area ”The data warehouse” Presentation servers Data sources Legacy systems
Why a data warehouse? DBMS DB End user application A DBMS DB End user application B DBMS DB End user application C DBMS DB
A data warehouse definition William Inmon: A data warehouse ”is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management decision-making process”.
“Subject-oriented” Operational Systems Informational Systems Sales System Customer Data Vendor Data Payroll System Purchasing System Product Data
“Integrated” Operational Systems Informational Systems Marketing System Customer Data Order System Billing System
“Time-variant” Operational Systems Informational Systems Customer Data Order System 60-90 days 5-10 years
“Non-volatile” Operational Systems Create Informational Systems Update Delete Order System Load Access Customer Data Insert
Another definition Chaudhiri & Dayal: ”Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions.” A data warehouse is a decision support system (DSS) according to Chaudhri&Dayal and Kimball
Typical DSS queries • How much is the total sale by month for each office? • Which of two new medications will result in the best best outcome: higher recovery rate or shorter hospitality rate? • How much is the percentage change in the total monthly sales for each product? • Which customer groups are most profitable?
Goals of the data warehouse (Kimball,p. 10) • Foundation for decision making - the right data in it to support decision making Understandable - content are correctly labeled and obvious • Navigable - recognising destination on screen and getting there in one click • Fast performance - means zero wait time • Consistent information - high quality data • Designed for continuous change - handle changes • Secure bastion that protect our information asset ”Anything else is a compromise and therefore something we must improve” Kimball
The data warehousing architechture Analysis/OLAP Data warehouse External sources Extract Transform Load Refresh Serve Query/Reporting Operational DBs Data marts Data mining Falö aöldf flaöd aklöd falö alksdf End user applications Data staging area Presentation servers Source systems
Source systems characteristics Operational DBs • the source data often in OLTP (Online Transaction Processing) systems • high level of transaction throughput • already occupied by the normal operations of the organisation • a OLTP system may be reliable and consistent, but there are often inconsistencies between different OLTP systems • different types of data format and data structures in different OLTP systems OLTP vs. DSS (Decision Support Systems)
The data staging area Extract Transform Load Refresh Often the most complex part in the architecture, and involves... • Extraction • Transformation • Load/refresh and indexing Implement script for extraction, transformation, load and refresh
The data staging area Extract Transform Load Refresh Transformation involves… • data conversion • data cleaning • data scrubbing (use domain-specific knowledge (e.g postal adresses) to check the data) • data migration (specify transformation rules, “replace the string gender by sex”) • data auditing (discover suspicious pattern, discover violation of stated rules) • data aggregation • data enrichment
What is OLAP? • Acronym for On-line analytical processing • A decision support system (DSS) which goal is to support ad-hoc querying for the business analyst • The idea is to allow the users to easy and quickly manipulate and visualise the data through multidimensional views. • Multidimensional view of data is the foundation for OLAP
A data cube: office 130 office 130 2 300 quarter 2 300 5 024 200 product 5 024 quarter product 200 product “Multidimensional” view of the data - a popular conceptual model that influenced front-end tools, database design, and the query engine for OLAP - numeric measures/facts (e.g. number of, sum, total sales) depends on a set of dimensions Spreadsheets:
office office office quarter quarter quarter product product product “Multidimensional” view of the data promotion campaign office quarter product customer group
“Multidimensional” view of the data Promotion campaign Quarter Measures/facts Customer group Promotion campaign Office
Database schema integration Platform independent logical star join schema Platform independent information model for the organisation customer name customer nr Platform independent logical ER schema customer Platform specific logical star join schema created in SQL server tool or use SQL DDL direct Platform specific logical ER schema created in SQL server tool or use SQL DDL direct customer customer A data warehouse system Source system DB2 OLAP Integrated Server DB2 OLAP Server (MOLAP) SQL DB Server Meta data repository - logical Starschema för DB 2 OLAP (PSM) Meta data repository - logical ER-schema for SQL Server Nisse Nisse DB2 OLAP Fysisk lagring i arrayer Data source customer Data source customer
Dimensional modelling - Star-join schema Service used Time Telephone calls Sales Dimension Customer - date - month - quarter - year - service name - service group - sum ($) - number of calls - customer name - address - region - income group - seller name - office
Dimensional modelling - Star-join schemas Service Dimension Time Dimension Fact table - Transactions Number Sum of calls C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 Customer Dimension Sales Dimension
Service Dimension Time Dimension Fact table - Transactions Number Sum of calls C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 Customer Dimension Sales Dimension Dimensional modelling - Star-join schemas Query: For how much did customers in Sthlm use service “Local call” in october 1999? S=37:00
The data warehouse bus Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center
The data mart “A logical subset of a complete data warehouse” “A data warehouse is made up of the union of all its data marts” “A data mart must be built from comformed dimensions and conformed facts” “…as the restriction of the dw to a single business process or to a group of related business processes...” (Kimball, p. 18) • departemental subsets (e.g. marketing data mart) • far smaller data volumes, fewer data sources • easier data cleaning process, faster roll-out • allows a “piecemeal” approach to some of the enormous integration problems involved in creating an enterprise wide data model, but complex integration in the long term
Snow-flake schema Service used Time Telephone calls Customer Service group Income group Region Year Month - service name - date Quarter - sum ($) - number of calls Sales Dimension - customer name - address - seller name Office
Hierarchies Year Service group Quarter Service Month Date
Some important terminology • dimensions - a business perspective from which data is looked upon, dimensions has often hierarchical structured attribute (day-month-quarter) • hierarchies - impose structure on some dimensions (the attributes address and region in the sales dimension) • de-normalisation - the abandonment of the rules of normalisation and allowance of repetitive storage of data • granularity - the level of detail of data contained in the data warehouse • aggregation - a summation of detailed data to a less detailed level
The presentation servers Data warehouse OLAP servers • standard relational database management system (RDBMS) with specialised SQL servers • extended relational database management system (RDBMS),called Relational OLAP (ROLAP) servers • multidimensional database management system (MDBMS), called Multidimensional OLAP (MOLAP) servers • hybrid of ROLAP and MOLAP, called Hybrid OLAP (HOLAP) Data marts
Alternative implementations of a DW OLAP servers • Extended Relational DBMS (ROLAP servers) • ROLAP server sit between relational back end server • data stored in RDB • star-join schemas • support SQL extensions • index structures • Multidimensional DBMS (MOLAP servers) • data stored in arrays (n-dimensional array) • direct access to array data structure • excellent indexing properties • poor storage utilisation, especially when the data is sparse. Data warehouse Data marts
More about presentation servers • Index structures (bit map indices, join indices) • SQL extensions (operators like Cube, Crossjoin) • Materialised views
The data warehousing architechture End user applications Data staging area Presentation servers Source systems Monitoring & Administration Metadata repository OLAP servers Analysis Data warehouse External sources Extract Transform Load Refresh Query/Reporting Serve Operational DBs Data mining Falö aöldf flaöd aklöd falö alksdf Data marts
Metadata Data about data • Administrative metadata (includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas, dimensions, hierachies, predefined queries, physical organisation, rules and script for extraction, transformation and load, back-end and front end tools) • Business metadata (business terms and definitions, ownership of data) • Operational metadata (information collected during the operations of the DW, e. g. usage statistics, error reports)
End user applications Analysis • OLAP tools • Query/Reporting tools • Data mining Query/Reporting Data mining Falö aöldf flaöd aklöd falö alksdf
Spreadsheet output of OLAP tool mounth quarter product product group office region
Functionalities of OLAP tools • Drill-down - decreasing the level of aggregation • Drill-up/Roll-up - increasing the level of aggregation • Drill-across - move between dimensions • Slicing and dicing - selection of rows and projection of columns • Pivoting - e.g. columns to rows, rows to columns • Ranking - sorting “Think of an OLAP data structure as a Rubik´s Cube of data that users can twist and twirl in different ways to work through what-if an what-happend scenarios” Lee Thé
What is data mining? Data Mining is data analysis in order to discover hidden correlations (pattern, rules) in huge data sets “Data Mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions” Cabena, Hadjinian, Stadler, Verhees, Zanasi
Next lecture [Kimball98] chapter 5, 6, 14 Why dimensional modelling?, p.147-153 Kimball´s definition of a data warehouse, p. 19