Data Warehousing and OLAP

Data Warehousing and OLAP ENGI3675 Database Systems

Introduction • Cow book chapter 25 • Data mining book chapter 4 • Useful to analyze historical data • Discover trends • Make more informed decisions • Decision support • Requires comprehensive databases • Data from several separate sources • Historical data • Data warehouse • Different from regular SQL database • Almost no update operations • Lots of statistical operations (not supported by SQL)

Data Warehousing • A data warehouse is a massive database • Multidimensional • Millions of records, terabytes of data • Consolidates data from multiple smaller databases and sources • Contains historical data for long periods of time • But not necessarily immediate up-to-date data • Useful for statistical analysis and trends detection • Not used for operational day-to-day data management

Data Warehousing Decision Support Operational databases OLAP(Online Analysis Processing) Data cleaning and preprocessing Data analysis User Services Data Warehouse Data Mining External Sources

Data Warehousing • Top-down creation • Design the data warehouse then collect data • Pro: minimizes integration problems • Con: resulting warehouse lacks flexibility • Bottom-up creation • Populate warehouse from existing sources • Pro: use all existing data • Con: requires data extraction, cleaning and transformation steps

Metadata Repository • Databases have a catalog • An additional table that contains data about the other tables • Table name, number of entries, attributes names and order, integrity constraints, indexes, etc. • Useful for query optimization • Data warehouses have a metadata repository • Larger and more complex than catalog • Needs additional information: original source of data, load date into the warehouse, cleaning algorithms used, etc.

Data Cubes • Data in warehouse is best modelled as a data cube • Multidimensional data model • Dimension is a “perspective” (attribute or entity) by which you can sort the data • Data is n-dimensional (not just 3D… the word “cube” is misleading!) • A normal database table is a 2D data cube

Data Cubes (example) • We have a table storing sales values for categories of items in each quarter of the year (2D) • For one location and one year

Data Cubes (example) • Now say we have multiple locations

605 825 400 Quarter 4 3 2 1 680 952 512 812 1023 501 C T F Category Data Cubes (example) • It’s a 3D data cube 927 1038 580 TB TO NY Location

605 825 400 651 725 500 Quarter 4 3 2 1 Quarter 4 3 2 1 680 952 512 780 932 552 812 1023 501 873 1099 401 C T F Category C T F Category Data Cubes (example) • Add in years and it’s a 4D cube 2012 2013 Year 928 2031 670 927 1038 580 TB TO NY Location TB TO NY Location

Data Cubes • We can generate cubes (or cuboids) for any subset of dimensions of the data • Taking a big-picture view, we have a lattice of cubes detailing some attributes and summarizing others • The cube displaying all n dimensions is the base cuboid • Lowest level of summarization • The cube displaying no dimensions is the apex cuboid • Highest level of summarization

Data Cubes (example) Apex cuboid all location year quarter category quarter, location quarter, category location, year category, year quarter, year category, location quarter, location, year category, location, year quarter, category, location quarter, category, year quarter, category, location, year Base cuboid

Data Model Schema • ER model appropriate for relational database • Data warehouse requires model that illustrates data topics and levels • Star schema • Snowflake schema • Fact constellation schema

Data Model Schema • Star schema • Simplest, most common • One central fact table with the data • n dimension tables detailing the n dimensions • Only one table per dimension! • Introduces redundancy • Multiple tuples with same attribute value will duplicate that attribute • Those should be moved to a separate table • Prevents creation of attribute value hierarchies

Data Model Schema (example) • Star schema

Data Model Schema • Snowflake schema • Star schema that allows normalization of dimension tables • Can be more than one table per dimension • Reduces redundancy • But at the cost of computation time for join • Therefore not popular

Data Model Schema (example) • Snowflake schema

Data Model Schema • Fact constellation schema • Larger data warehouses can have multiple independent fact tables (stars) that share a common dimension • Stars become connected in constellations • Also called “galaxy schema”

Data Model Schema (example) • Fact constellation schema

Data Model Schema (exercise) • A company uses an RFID system to track tools throughout several company buildings • Sensor in each room of each building reads and reports all RFID tags within range every second • They need a data warehouse to keep a history of the information • It is also necessary to keep track of departments • Each room is assigned to a department • Each department has a specialization • Each tool is assigned a specialization

Data Model Schema (exercise)

Concept Hierarchy • Attribute can be structured in concept hierarchies • Useful for building in order in the values • Useful for data mining algorithms all Canada USA Québec Ontario Illinois New York Montreal Victoriaville Thunder Bay Toronto Chicago Urbana NYC

Concept Hierarchy • Some attributes have total order • Only one possible ordering of attributes from top to bottom • Creates a clean hierarchy • Some attributes have partial order • Some attributes are not connected to each other • Creates a lattice Year Country Quarter Province Month City Week Day Street

605 825 400 Quarter 4 3 2 1 680 952 512 812 1023 501 C T F Category OLAP Operations • OnLine Analytical Processing • Set of tools for analysis of multidimensional data at various levels of granularity • Building blocks for data mining and decision support algorithm • Operations on data cubes & concept hierarchies 927 1038 580 TB TO NY Location

OLAP Operations • Roll-up • Aggregating a measure over a dimension • Given the sales per city, we roll up into sales per province • Drill-down • Decomposing an aggregated measure over a dimension • Given sales per province, we drill down to sales per city

OLAP Operations • Slice • An equality query on one or more dimension • You’re taking a slice of the cube • We slice off the sales data from Thunder Bay • Dice • A range query on one or more dimension • You’re dicing off a smaller cube • We make a dice of sales in TBay and Toronto for toys and food in Quarters 2, 3, 4

OLAP Operations • Pivot (rotate) • Changing the visualisation angle of the data • 2D: transpose of the data • 3D: rotating cube, or splitting it into a set of 2D tables

605 825 400 605 825 400 2411 2341 8941 Quarter 4 3 2 1 1481 2445 6124 680 952 512 680 952 512 952 512 605 680 812 927 4 3 2 1 4 3 2 1 JFMAMJJASOND 1214 2187 3145 812 1023 501 812 1023 501 1023 501 825 952 1023 1038 C T F 4 3 2 400 512 501 580 4 3 2 1 T F C T F C T F Category C T F C T F Category OLAP Operations Roll-up Dice 927 212 128 486 522 548 125 148 974 156 114 152 Slice 451 1235 1445 187 156 874 456 985 235 154 458 1231 847 4125 1239 457 685 124 159 874 123 187 487 1124 Drill-down Pivot 2145 3251 8124 927 1038 580 927 1038 580 1038 580 CA US TB TO NY Location TB TO NY Location TB TO

Efficiency Considerations • Data warehouses contain massive amounts of data • TB in size, millions of tuples • But queries must be answered quickly • Seconds at the most • Efficiency is an issue • Biggest cost comes from aggregation operations, e.g. computing different cuboids in lattice (p. 13)

Efficiency Considerations: Precomputing Cuboids • Solution, precompute all cuboids in lattice? How many could there be? • n dimensions = 2n cuboids • But wait! Some dimensions have hierarchies of attributes associated to them • Dimension ihas Li dimensions • Clearly not a good solution • Although we can still precomputesome cuboids, if we know which ones are most likely to be used cuboids

Efficiency Considerations: Refreshing Cuboids • When the data in the warehouse changes (refreshed when we load in an operational database), precomputed cuboids don’t change • We need to refresh them • We could simply delete the cuboid and recompute it from zero • Simple, requires no new algorithms • Can work well if the data is in another system • Can work well on smaller tables • But becomes expansive on larger relations • We can refresh the cuboid incrementally • Only update the tuples that changed • Cost proportional to change

Efficiency Considerations: Refreshing Cuboids • Projection cubes • The cuboid is a projection of attributes from a single relation • Slice, dice and pivot operations • Incremental insert • Compute the projection of the new tuples, and add to the cuboid • Incremental delete • Compute the projection of the tuples to remove, and subtract from the cuboid

Efficiency Considerations: Refreshing Cuboids • Join cubes • The cuboid is a join of two relations • Roll-up and drill-down with additional tables for hierarchies • Incremental insert • Compute the join of the new tuples in one relation with the other relation, and add to the cube • Incremental delete • Compute the join of the tuples to remove in one relation with the other relation, and subtract from the cube

Efficiency Considerations: Refreshing Cuboids • Aggregation cubes • Cuboid performs an aggregate function on some attribute • Roll-up operation • We need to maintain detailed information on the role of each row in the aggregated value of the group • COUNT: maintain a counter of the number of each instance counted, remove when counter = 0 • SUM: maintain a counter of the number of tuples summed; remove when counter = 0 (not sum = 0) • AVG: maintain a sum and counter, average = sum/counter • MAX/MIN: maintain the value and counter of instances of that value; easy to insert, but when counter = 0 we must scan the whole group to update

Efficiency Considerations: Refreshing Cuboids • Immediate maintenance • Refresh the cuboid whenever the tables are updated • Cuboid always up-to-date • Slows down the update operations • Deferred maintenance • Lazy update: update a cuboid when queried • Periodic update: update a cuboid at a set time • Forced update: update a cuboid after a given number of table updates

Efficiency Considerations: Indexing • Indexing data allows faster search and retrieval • Data warehouses make use of two new indexes • Bitmap index • Join index

Efficiency Considerations: Indexing • Consider an attribute with few possible values (sparse) that can be represented as binary words (bit vectors) • Example: • A column in this bit table is a bitmap index

Efficiency Considerations: Indexing • Advantages: • More compact and compressible than B+ tree & hash indexes • More efficient: can make use of bit operations • Example: which male employees have a rating of 3 & =

Efficiency Considerations: Indexing • Join index to speed up join queries • Indexes results of join operations • Allows us to skip the join altogether • Particularly useful with star/constellation schema • Example: • Sales per location per category is a join query

Efficiency Considerations: Indexing • Example (continued): • The join index will store the results of the join • Discard all unnecessary results (e.g. categories not sold at certain locations) • Subsequent queries on sales per location will be more efficient (no unnecessary I/O)

Finding Answers Quickly • A recent trend, fueled mostly by the Internet, is an emphases on queries for which a user only wants the first few, or the ‘best’ few, answers quickly. • A related trend is that, for complex queries, users would like to see an approximate answer quickly and then have it be continually refined, rather than wait until the exact answer is available • Especially important when dealing with data warehouses, where taking all the data into account takes enormous time

Finding Answers Quickly • For queries with a lot of results, sometimes users only want the top-N results, rather than complete results • For long queries, sometimes users want preliminary approximate results right away, rather than wait for the exact results

Top-N Results SELECT P.pid, P.pname, S.sale FROM Sales S, Products P WHERE S.pid = P.pid AND S.locid =1 AND S.timeid = 3 ORDER BY S.sale DESC OPTIMIZE FOR 10 ROWS • Without the OPTIMIZE command, the query would compute all sales – wasteful • The OPTIMIZE command is not standard SQL • What to do without it?

Top-N Results SELECT P.pid, P.pname, S.sale FROM Sales S, Products P WHERE S.pid = P.pid AND S.locid =1 AND S.timeid = 3 AND S.sale > c ORDER BY S.sale DESC • If you know the approximate value c of the top-N selling products, add it to the query • Issues • How to discover c? • What if we get more than N results? • What if we get less than N results?

Online Aggregation SELECT L.region, AVG(S.Sale) FROM Sales S, Location L WHERE S.locid = L.Locid GROUP BY L.region • Assume sales and location are massive tables • Or distributed over several computers over network • Takes a long time to completely compute the results • Online aggregation • Compute and display approximate results, and update them as the computation goes on • But the DBMS now needs to give confidence values in the results

Online Aggregation • Probability of 93% that sales in Alberta are $2,832.50 ± $132.20, based on 40% of the computations • The user does not prioritize that region • The DBMS needs • Statistical algorithms for the confidence intervals • Non-blocking algorithms

Summary • A data warehouse is a massive multidimensional heterogeneous database of historical and statistical data • Requires different ways of seeing data • Data cubes and lattice • Database schemas: star, snowflake, constellation • Requires new operations • OLAP: Roll-up, drill-down, slice, dice, pivot • Has more efficiency constraints • New indexes • Top results • Online operations

Exercises Cow Book Data Mining Book • 25.1 • 25.2 • 25.3 • 25.6 • 25.9 (use cuboids instead of views) • 25.10 (use cuboids instead of views) • 4.2 • 4.3 • 4.4 • 4.5 • 4.6 • 4.13

Data Warehousing and OLAP