Efficient Data Cube Computation Methods for Big Data Analytics

Implementação do DW

O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações eficientes • Materialização de vistas para informação agregada: Identificar, explorar e actualizar de forma eficiente • Índices: Index intersection, bit map indices, join indices H. Galhardas

Escolha dos agregados Objectivo: Minimizar o tempo de resposta Pergunta: Quais os agregados a materializar? Factores: • Espaço de armazenamento – custo baixo! • Custo de actualização – custo alto! Compromisso: Tempo de resposta vs custo de actualização H. Galhardas

Cube Operation • Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year • Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () 2^n cuboids () (city) (item) (year) (city, item) (city, year) (item, year) H. Galhardas (city, item, year)

Efficient Data Cube Computation • Goal: efficient computation of aggregations across many sets of dimensions • Take into account amount of main memory available and computation time • Data cube can be viewed as a lattice of cuboids • Materialization of data cube • Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) H. Galhardas

Partial materialization of cuboids Three factors must be considered: • Identify the subset of cuboids to materialize • Take into account queries in the workload, their frequencies and accessing costs; cost of incremental updates, total storage requirements • Popular approaches: use heuristics or materialize cuboids based on other frequently referenced cuboids • Exploit the materialized cuboids during query processing • How to use indexes and how to transform OLAP operations onto the selected cuboids • Efficiently update the materialized cuboids during load and refresh • Explore parallelism and incremental update H. Galhardas

Classification of agg. facts wrt the agg. functions Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. Ex: count(), sum(), min(), max(). Algebraic:if it can be computed by an algebraic function with M arguments (where M is a constant), each of which is obtained by applying a distributive aggregate function. Ex: avg(), min_N(), standard_deviation(). Holistic:if there is no algebraic function with M arguments that characteriezes the computation. Ex: median(), mode(), rank(). H. Galhardas

Arquitecturas de servidor OLAP Relational OLAP (ROLAP) • Usa SGBDs relacionais ou relacional extendido para armazenar e gerir os dados do datawarehouse e usa middleware OLAP para suportar funcinalidades específicas do OLAP. • Inclui optimização suportada pelo SGBDR, implementa lógica de navegação de agregação e serviços/ferramentas adicionais • Maior escalabilidade Multidimensional OLAP (MOLAP) • Motor de armazenamento multidimensional baseado em arrays (sparse matrix techniques) • Indexação rápida de dados sumarizados pré-calculados Hybrid OLAP (HOLAP) • Flexibilidade: baixo nível: relacional, alto nível: array Specialized SQL servers • Suporte especializado para interrogações SQL sobre esquemas em estrela e floco de neve H. Galhardas

Some efficient cube computation methods • ROLAP-based cubing algorithms (Agarwal et al’96) • Array-based cubing algorithm (Zhao et al’97) • Bottom-up computation method (Bayer & Ramarkrishnan’99) • And others... H. Galhardas

ROLAP-Based Cubing Algorithm • Value-based addressing: dimension values accessed by key-based addressing strategies • Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples • Grouping is performed on some subaggregates as a “partial grouping step” • Aggregates may be computed from previously computed aggregates, rather than from the base fact table H. Galhardas

Iceberg Cube • Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup • Motivation • Only a small portion of cube cells may be “above the water’’ in a sparse cube • Only calculate “interesting” cells—data above certain threshold • Avoid explosive growth of the cube • Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2? H. Galhardas

C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 56 9 b2 B 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A Multi-way Array Aggregation for Cube Computation • Direct array addressing: dimension values accessed via the index of their corresponding array locations • Partition arrays into chunks (a small subcube which fits in memory). • Compressed sparse array addressing: (chunk_id, offset) • Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, thereby reducing memory access and storage cost. What is the best traversing order to do multi-way aggregation? H. Galhardas

Cuboids • ABC – base cuboid – already computed • AB, AC, BC – 2-D cuboids • A, B, C – 1-D cuboid • All – 0-D cuboid • Many possible orderings w/ which chunks can be read into memory H. Galhardas

C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A Example (1) B H. Galhardas

Example (2) C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 B 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A H. Galhardas

Multi-Way Array Aggregation for Cube Computation (Cont.) Method: the planes should be sorted and computed according to their size in ascending order. • See the details of Example 2.12 (pp. 75-78) • Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions • If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored H. Galhardas

Indexing OLAP Data: Bitmap Index • Index on a particular column • Each value in the column has a bit vector: bit-op is fast • The length of the bit vector: # of records in the base table • The i-th bit is set if the i-th row of the base table has the value for the indexed column • Not suitable for high cardinality domains Base table Index on Region Index on Type Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer H. Galhardas

Indexing OLAP Data: Join Indices • Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) • Traditional indices map the values to a list of record ids • It materializes relational join in JI file and speeds up relational join — a rather costly operation • In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. Ex: fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city H. Galhardas

Example of join indexes H. Galhardas

Efficient Processing of OLAP Queries • Determine which operations should be performed on the available cuboids: • transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection • Determine to which materialized cuboid(s) the relevant operations should be applied. • Exploring indexing structures and compressed vs. dense array structures in MOLAP H. Galhardas

Bibliografia • (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts) • (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997 • (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996 H. Galhardas

Efficient Data Cube Computation Methods for Big Data Analytics

Efficient Data Cube Computation Methods for Big Data Analytics

Presentation Transcript

Status and Effectiveness of Ignition Interlock Laws

Gregory (Greg) Maltby, PMP, BSCS October 11, 2010 EECS 710