360 likes | 495 Vues
MDL Summarization with Holes. Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada. Introduction. Multi-dimensional OLAP queries typically produce data intensive answers
E N D
MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada
Introduction • Multi-dimensional OLAP queriestypically produce data intensive answers • Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions: • Simple enumeration: accurate but not necessarily the most intuitive; • Summaries: not (necessarily) 100% accurate but can be more intuitive and informative. • Summarized answers can be more easily understood Shaofeng Bu UBC
OLAP Data Cube Example clothes • Each dimension is associated with a hierarchical tree women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York
OLAP Data Cube Example clothes • Data Cell: (c1,c2), c1,c2 are leaf-nodes in axis-trees, e.g. (Vancouver, ties) • Data Region: describes all data cells covered by given nodes in the axis-trees, (x1, y1), e.g.: • (Vancouver, ties) • (Vancouver, women’s) • (northwest, women’s) women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York
OLAP Data Cube Example clothes • Blue cells: the cells that satisfy the query conditions; • How to find a summary of the blue cells in a data cube? women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York
MDL Summarization • MDL: Minimum Description Length • Use regions to cover the blue cells; • Length of an MDL description is the number of included regions and cells; • MDL is to find the description with the minimum length. Shaofeng Bu UBC
R1 R2 R3 R4 R5 R7 R8 R6 R9 An Example of MDL Summarization clothes women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York
Not blue cells any more MDL Summarization 10 regions ?R1 8 single blue cells R2 ?R3 R4 Total length = 18 R5 R7 R8 R6 R12 ?R9 R13 R10 R11 A Motivating Example: A New Case clothes women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York
Can we do better? • Yes! • We present a new compression approach: MDL with Holes: • Identify regions with blue cells, even if they contain non-blue cells; • Express the included blue cells by using regions with the exception of the covered non-blue cells; • Non-blue cells are called holes. Shaofeng Bu UBC
?R1 Plus other 6 regions R2 ?R3 R4 R1+R3-(Vancouver,Skirts) R5 R7 R8 R6 ?R9 A Motivating Example: MDL with Holes clothes R1-(Vancouver,Skirts) • MDL with Holes: • Length = 6+3+3=12 • MDL Approach: • Length is 18 women’s men’s R3-(Vancouver,Skirts) women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties R9-(Boston,ties) -(New York, dress skirts) Vancouver Edmonton northwest San Jose San Francisco Chicago location midwest Minneapolis Boston Summit northeast Albany New York
Problem Statements • MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit. • In practice, we can drill down on regions to get additional details. Shaofeng Bu UBC
x s t g f h b c d e a Definitions: Length & Benefit • Given a set B of data cells (blue cells), an MDLH description for B: • D=S – H , • S is a set of data regions, • H is a set of data cells, also called ‘holes’, • D covers exactly the data cells in B. • Length: total number of the included regions and cells in the description. |D|=|S|+|H| • Benefit : how much shorter is the MDLH summary than the enumeration of B. Benefit (D) = |B| – | D| • B1={a, b, c} • D1= s – d • |D1|=2 • Benefit(D1) = |B1| - |D1| = 1 • B2={e, g} • D2= t – f – h • |D2| = 3 • Benefit(D2)= |B2| - |D2| = -1 Shaofeng Bu UBC
Related Work • The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 • Reduce description length byallowing non-blue cells to be covered in the regions • The regions are not pure. • Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003 • Allow Cartesian products to be formed; • Not purely hierarchical: NP Completeness result is less surprising; • What about the pure hierarchical? • Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001 • Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.
Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Complete • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC
z x y s t u v w b c d e f g h i j k l m n o p q r a 1-D Case: MDLH is Tractable • MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case. • ‘x’ • D1= x – d – f – j • Benefit(D1) = 7 – 4 = 3 • D2=(s – d ) + e + ( u – j ) • Beneift(D2) = 7 – 5 = 2 • ‘y’ • D3 = y – m – p – q – r • Benefit(D3) = 4 – 5 = -1 • D4 = ( v – m ) + o , • Benefit(D4) = 4 – 3 = 1 • ‘z’ • D5 =z – d – f – j – m – p – q – r • Benefit(D5) = 11 – 8 = 3 • D6=(x – d – f – j)+( v – m + o ) • Benefit(D6) = 11 – 7 = 4
Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC
(i,2),(i,3),(i,4) 4 0 (i,6),(i,7) 2-D Case: Optimality is not Preserved Any More 8 rows length benefit 1 2 3 4 5 6 7 (f,8),(g,8) 3 2 a b (c,8),(d,8),(e,8) 4 0 c (a,8),(b,8) 5 -2 i d e f columns length benefit g (i,1) 3 2 • Optimal Solution: • {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)} • -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4) • +(e,2)+(e,3)+(e,4)} • +(f,1)+(g,1)+(f,6)+(g,7) • Length = 19 Benefit = 28-19 = 9 (i,5) 5 -2
Clique MDLH is NP-Hard in 2-D Case • It is NP-Hard to find the optimal MDLH description in 2-D data cube; • Not a Trivial Proof: Details are in the paper; • Reduction Strategy: Maximum Induced Subgraph in Complete Edge-Weighted(CEW) Bipartite Graph MDL with Holes Shaofeng Bu UBC
Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC
Heuristics for MDLH • Greedy • Each time,choose the row/column with the most benefit • Dynamic Programming • A bottom-up method to get the description of a region from the descriptions of its children regions • Quadratic Programming • Using a quadratic function to represent the benefit of a 2-d data cube Shaofeng Bu UBC
12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Example for Comparison with Heuristics • The optimal description for this example: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(e,8)+(a,11)-(a,8) Length = 12 Benefit = 8 Shaofeng Bu UBC
12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Description by Greedy: (e,6)+(a,11)+(e,8)-(a,8) +(d,10)-(d,5) +(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3) The length is 13 The benefit is 20-13 = 7 Heuristics: A Greedy Heuristic region length benefit holes (e,6) 1 3 - (d,10) 2 2 (d,5) (e,1) 2 1 (a,1) (e,2) 2 1 (b,2) (e,3) 2 1 (b,3) (e,8) 2 1 (a,8) (a,11) 2 1 (a,8) (c,10) 3 0 (c,4)(c,5) Shaofeng Bu UBC
12 12 11 11 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b e e c c d d Optimal Description Greedy: Why it is not optimal? • A selection of row/column may reduce more total benefit Description from Greedy Shaofeng Bu UBC
12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Heuristics: Dynamic Programming L: The Length of a Region t2 t1 S: Selection of Rows & Columns • (a,10) : (a,2) + (a,3) • L(a,10)=2, S(a,10)=‘t2’ • (e,4) : (d,4) • L(e,4)=1, S(e,4)=‘t1’ • (d,10): (d,10) – (d,5) • L(d,10)=2, S(d,10)=‘g’
12 11 10 1 2 3 4 5 6 7 8 9 a b D(e,12)=D(e,10)+D(e,11) e c d D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5) D(e,6)+D(e,7)+D(e,8)+D(e,9) (e,3)-(b,3) (e,1)-(a,1) (e,2)-(b,2) (b,5) (d,4) (e,6) (a,7) (e,8)-(a,8) (a,9) Heuristics: Dynamic Programming(2) t2 t1 D(x1,x2):description for region (x1,x2) S (e,12)=‘t2’ S (e,10)=‘t2’ S (e,11)=‘t2’ Generated Description: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9) The length is 13 and the benefit is 20-13 = 7
12 12 11 11 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b e e c c d d Dynamic Programming: Why it is not optimal? • Misses the combination of rows and columns Description by Dynamic Programming Optimal Description Shaofeng Bu UBC
Heuristics: Quadratic Programming • Use variables to represent rows/columns; for a variable v: • v=1: the corresponding row/column is selected; • v=0: the corresponding row/column is not selected; • f = – Benefit( D) • Maximizing the benefit is to minimize the value of f • For the previous example, quadratic programming generates the optimal description; • Optimality is not guaranteed. Shaofeng Bu UBC
Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC
Experiments • We ran a set of experiments on the TPC-H benchmark data set; • We compared the three MDLH heuristics with MDL and GMDL. Shaofeng Bu UBC
Experimental Results: Comparison of All Methods • Compression Ratio: • MDLH-Quadratic generates the most concise descriptions: a yardstick of quality; • MDLH-Dynamic is a very close second.
Experimental Results: Compression Ratio • The more children per parent node, the greater the benefit
Experimental Results: Summary • Running time & Scalability: • MDLH-Greedy is the fastest; • MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells; Shaofeng Bu UBC
Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC
11 10 2 3 4 5 6 7 8 9 1 a b e c d Extension: Summarization on holes • As the blue density becomes high, a large part of the MDLH description is made up of holes. • Can we further reduce the total length by summarizing ‘Holes’? • MDLH description is: • (a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) • Total length is 10. • Summarization on holes: • (a,6)+(a,8)+(a,9) = (a,10)-(a,7) • (d,6)+(d,7)+(d,8) = (d,10)-(d,9) • After summarization on holes: • (a,11) - { (a,10) - (a,7)} +(d,11) - { (d,10) - (d,9)} +(b,6) + (c,8) • Total length is 8.
Conclusions & Contributions • We present a new method, MDLH, to compress the answers of OLAP queries; • We present a bottom-up algorithm for 1-d cube; • We proved the NP-Hardness of the MDLH problem; • We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming; • We extended the summarization on holes to further reduce the total length; • We did a set of experiments on the TPC-H benchmark data to compare the heuristics. Shaofeng Bu UBC
On going work • Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: • Return summarized answers to user’s queries; • Provide drill down operation for users: • Browse details on blue cells • Browse details on holes • Design k-approximation algorithm for MDLH: • What is the best quality we can guarantee? Shaofeng Bu UBC