On-Line Application Processing

On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)

Overview • Traditional database systems are tuned to many, small, simple queries. • Some new applications use fewer, more time-consuming, complex queries. • New architectures have been developed to handle complex “analytic” queries efficiently.

The Data Warehouse • The most common form of data integration. • Copy sources into a single DB (warehouse) and try to keep it up-to-date. • Usual method: periodic reconstruction of the warehouse, perhaps overnight. • Frequently essential for analytic queries.

OLTP • Most database operations involve On-Line Transaction Processing (OTLP). • Short, simple, frequent queries and/or modifications, each involving a small number of tuples. • Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

OLAP • Of increasing importance are On-Line Application Processing (OLAP) queries. • Few, but complex queries --- may run for hours. • Queries do not depend on having an absolutely up-to-date database.

OLAP Examples • Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. • Analysts at Wal-Mart look for items with increasing sales in some region.

Common Architecture • Databases at store branches handle OLTP. • Local store databases copied to a central warehouse overnight. • Analysts use the warehouse for OLAP.

Loading the Data Warehouse Data is periodically extracted Data is cleansed and transformed Users query the data warehouse Data Staging Area Data Warehouse Source Systems (OLTP)

Terminology: ETL • ETL = Extraction, Transformation, & Load • Extraction: Get the data out of the source systems • Transformation: Convert the data into a useful format for analysis • Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.)

Data Integration is Hard • Data warehouses combine data from multiple sources • Data must be translated into a consistent format • Data integration represents ~80% of effort for a typical data warehouse project! • Some reasons why it’s hard: • Metadata is often poor or non-existent • Data quality is often bad • Missing or default values • Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) • Inconsistent semantics • What is an airline passenger?

Rewritten Queries Query Extraction Query Answer Answer Federated Databases • An alternative to data warehouses • Data warehouse • Create a copy of all the data • Execute queries against the copy • Federated database • Pull data from source systems as needed to answer queries • “lazy” vs. “eager” data integration Mediator Warehouse Source Systems Source Systems Data Warehouse Federated Database

Star Schemas • A star schema is a common organization for data at a warehouse. It consists of: • Fact table : a very large accumulation of facts such as sales. • Often “insert-only.” • Dimension tables : smaller, generally static information about the entities involved in the facts.

Example: Star Schema • Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. • The fact table is a relation: Sales(bar, beer, drinker, day, time, price)

Example, Continued • The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

Visualization – Star Schema Dimension Table (Bars) Dimension Table (Drinkers) Dimension Attrs. Dependent Attrs. Fact Table - Sales Dimension Table (Beers) Dimension Table (etc.)

Dimensions and Dependent Attributes • Two classes of fact-table attributes: • Dimension attributes : the key of a dimension table. • Dependent attributes : a value determined by the dimension attributes of the tuple.

Example: Dependent Attribute • price is the dependent attribute of our example Sales relation. • It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes).

Narrow Big (many rows) Numeric Growing over time Wide Small (few rows) Descriptive Static Comparing Facts and Dimensions Facts Dimensions Facts contain numbers, dimensions contain labels

The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. A cross-tab is a table where values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers Values in individual cells are (aggregates of)the values of the dimension attributes that specify the cell. Cross Tabulation of sales by item-name and color

Marginals • The data cube also includes aggregation (typically SUM) along the margins of the cube. • The marginals include aggregations over one dimension, two dimensions,…

Visualization - Data Cube w/ Aggregation beer SUM over all Drinkers price bar drinker

Example: Marginals • Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). • It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…

Structure of the Cube • Think of each dimension as having an additional value *. • A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. • Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.

Relational Representation • Crosstabs can be represented as relations • The value all is used to represent aggregates • The SQL:1999 standard actually uses null values in place of all

Three-Dimensional Data Cube • A data cube is a multidimensional generalization of a crosstab • Cannot view a three-dimensional object in its entirety • but crosstabs can be used as views on a data cube

Data Cube • Axes of the cube represent attributes of the data records • e.g. color, month, state • Called dimensions • Cells hold aggregated measurements • e.g. total $ sales, number of autos sold • Called facts • Real data cubes have >> 3 dimensions Auto Sales Red Blue WA OR Gray CA Jul Aug Sep

WA Blue OR Total Blue Jul Aug Sep CA Jul Aug Sep Slicing and Dicing Red Red Blue Blue WA WA OR OR Gray Gray CA CA Jul Aug Sep Jul Aug Sep

Querying the Data Cube Number of Autos Sold • Cross-tabulation • “Cross-tab” for short • Report data grouped by 2 dimensions • Aggregate across other dimensions • Include subtotals • Operations on a cross-tab • Roll up (further aggregation) • Drill down (less aggregation)

Drill downby Color Number of Autos Sold Roll Up and Drill Down Number of Autos Sold Number of Autos Sold Roll upby Month

Full Data Cube with Subtotals • Pre-computation of aggregates → fast answers to OLAP queries • Ideally, pre-compute all 2n types of subtotals • Otherwise, perform aggregation as needed • Coarser-grained totals can be computed from finer-grained totals • But not the other way around

Data Cube Lattice State, Month, Color State, Month State,Color Month,Color RollUp DrillDown State Month Color Total

MOLAP vs. ROLAP • MOLAP = Multidimensional OLAP • Store data cube as multidimensional array • (Usually) pre-compute all aggregates • Advantages: • Very efficient data access → fast answers • Disadvantages: • Doesn’t scale to large numbers of dimensions • Requires special-purpose data store

Sparsity • Imagine a data warehouse for Safeway. • Suppose dimensions are: Customer, Product, Store, Day • If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… • …data cube has 1,000,000,000,000,000 cells! • Fortunately, most cells are empty. • A given store doesn’t sell every product on every day. • A given customer has never visited most of the stores. • A given customer has never purchased most products. • Multi-dimensional arrays are not an efficient way to store sparse data.

MOLAP vs. ROLAP • ROLAP = Relational OLAP • Store data cube in relational database • Express queries in SQL • Advantages: • Scales well to high dimensionality • Scales well to large data sets • Sparsity is not a problem • Uses well-known, mature technology • Disadvantages: • Query performance is slower than MOLAP • Need to construct explicit indexes

Creating a Cross-tab with SQL Grouping Attributes Measurements SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red' Filters

SQL aggregation query with GROUP BY does not produce subtotals, totals Our cross-tab report is incomplete. What about the totals? StateMonthSUM CA Jul 45 CA Aug 50 CA Sep 38 OR Jul 33 OR Aug 36 OR Sep 31 WA Jul 30 WA Aug 42 WA Sep 40 Number of Autos Sold

One solution: a big UNION ALL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red‘ UNION ALL SELECT state, "ALL", SUM(quantity) FROM sales GROUP BY state WHERE color = 'Red' UNION ALL SELECT "ALL", month, SUM(quantity) FROM sales GROUP BY month WHERE color = 'Red‘ UNION ALL SELECT "ALL", "ALL", SUM(quantity) FROM sales WHERE color = 'Red' Original Query State Subtotals Month Subtotals Overall Total

A better solution • “UNION ALL” solution gets cumbersome with more than 2 grouping attributes • n grouping attributes → 2n parts in the union • OLAP extensions added to SQL 99 are more convenient • CUBE, ROLLUP SELECT state, month, SUM(quantity) FROM sales GROUP BY CUBE(state, month) WHERE color = 'Red'

Results of the CUBE query StateMonthSUM(quantity) CA Jul 45 CA Aug 50 CA Sep 38 CA NULL 133 OR Jul 33 OR Aug 36 OR Sep 31 OR NULL 100 WA Jul 30 WA Aug 42 WA Sep 40 WA NULL 112 NULL Jul 108 NULL Aug 128 NULL Sep 109 NULL NULL 345 Notice the use of NULL for totals Subtotals at all levels

ROLLUP vs. CUBE • CUBE computes entire lattice • ROLLUP computes one path through lattice • Order of GROUP BY list matters • Groups by all prefixes of the GROUP BY list • GROUP BY ROLLUP(A,B,C) • A,B,C • (A,B) subtotals • (A) subtotals • Total • GROUP BY CUBE(A,B,C) • A,B,C • Subtotals for the following:(A,B), (A,C), (B,C), (A), (B), (C) • Total

ROLLUP example SELECT color, month, state, SUM(quantity) FROM sales GROUP BY ROLLUP(color,month,state) State, Month, Color State, Month State,Color Month,Color State Month Color Total

On-Line Application Processing