Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

Sunita SarawagiIIT Bombayhttp://www.it.iitb.ernet.in/~sunita I3: Intelligent, Interactive Investigation of multidimensional data

Multidimensional OLAP databases Fast, interactive answers to large aggregate queries. • Multidimensional model: dimensions with hierarchies • Dim 1: Bank location: • branch-->city-->state • Dim 2: Customer: • sub profession --> profession • Dim 3: Time: • month --> quarter --> year • Measures: loan amount, #transactions, balance

OLAP • Navigational operators: Pivot, drill-down, roll-up, select. • Hypothesis driven search: E.g. factors affecting defaulters • view defaulting rate on age aggregated over other dimensions • for particular age segment detail along profession • Need interactive response to aggregate queries..

Motivation • OLAP products provide a minimal set of tools for analysis: • simple aggregates • selects/drill-downs/roll-ups on the multidimensional structure • Heavy reliance on manual operations for analysis • tedious on large data with multiple dimensions and levels of hierarchy • GOAL: automate through complex, mining-like operations integrated with Olap.

State of art in mining OLAP integration • Decision trees [Information discovery, Cognos] • find factors influencing high profits • Clustering [Pilot software] • segment customers to define hierarchy on that dimension • Time series analysis: [Seagate’s Holos] • Query for various shapes along time: spikes, outliers etc • Multi-level Associations [Han et al.] • find association between members of dimensions

The Diff operator

Total sales dropped 30% in N. America. Why? Unravel aggregate data What is the most compact answer that user can quickly assimilate?

Solution • A new DIFF-operator added to OLAP systems that provides the answer • in a single-step • is easy-to-assimilate • and compact --- configurable by user. • Obviates use of the lengthy and manual search for reasons in large multidimensional data.

Example query

Compact answer

Example: explaining increases

Compact answer

Model for summarization • The two aggregated values correspond to two subcubes in detailed data. Cube-A Cube-B

Detailed answers Explain only 15% of total difference as against 90% with compact

Summarizing similar changes

MDL model for summarization • Given N, find the best N rows of answer such that: • if user knows cube-A and answer, • number of bits needed to send cube-B is minimized. N row answer Cube-A Cube-B

Transmission cost: MDL-based • Each answer entry has a ratio that is • sum of measure values in cube-B and cube-A not covered by a more detailed entry in answer. • For each cell of cube-B not in answer • r: ratio of closest parent in answer • a (b): measure value of cube A (B). • Expected value of b = a r • #bits = -log(prob(b, ar)) where prob(x,u) is probability at value x for a distribution with mean u. • We use a poisson distribution when x are counts, normal distribution otherwise

Algorithm • Challenges • Circular dependence on parent’s ratio • Bounded size of answer • Greedy methods do not work • Bottom up dynamic programming algorithm

N=0 N=0 Level 1 N=1 N=1 N=2 N=2 A new group formed N=0 N=0 N=1 Level 0 N=1 + tuple i min N=2 N=2 i Tuples with same parent Tuples in detailed data grouped by common parent..

Single pass on data --- all indexing/sorting in the DBMS: interactive. Low memory usage: independent of number of tuples: O(NL) Easy to package as a stored procedure on the data server side. When detailed subcube too large: work off aggregated data. Integration

80% time spent in data access. Quarter million records processed in 10 seconds Performance 333 MHz Pentium 128 MB memory Data on DB2 UDB NT 4.0 Olap benchmark: 1.36 million tuples 4 dimensions

The Relax operator

Example query: generalizing drops

Ratio generalization

Problem formulation Inputs • A specific tuple Ts • An upper bound N on the answer size • Error functions • R(Ts,T) measures the error of including a tuple T in a generalization around Ts • S(Ts,T) measures the error of excluding T from the generalization Goal To find all possible consistent and maximal generalizations around Ts

Algorithm Considerations • Need to exploit the capabilities of the OLAP data source • Need to reduce the amount of data fetches to the application 2-stage approach • Finding generalizations • Getting exceptions

Finding generalizations n = number of dimensions Li = levels of hierarchy of dimension Di Dij = jth level in the ith dimension hierarchy candidate_set {D11, D21…Dn1} // all single dimension candidate gen. k = 1 while (candidate_set ) g  candidate_set if (ΣTg S(Ts,T) > ΣTg R(Ts,T)) Gk  Gk  g // generating candidates for pass (k+1) from generalizations of pass k candidate_set  generateCandidates(Gk) //Apriori style // if gen is possible at level j of dimension Di , add its parent level to the candidate set candidate_set  candidate_set  {Di(j+1)|Dij Gk & j< Li} k k +1 Return iGi

Finding Summarized Exceptions Goal Find exceptions to each maximal generalization compacted to within N rows and yielding the minimum total error Challenges • No absolute criteria for determining whether a tuple is an exception or not for all possible R functions • Worth of including a child tuple is circularly dependent on its parent tuple • Bounded size of answer Solution Bottom up dynamic programming algorithm

Single dimension with multiple levels of hierarchies Optimal solution for finite domain R functions soln(l,n,v) : the best solution for subtree l for all n between 0 and N and all possible values of the default rep. soln(l,n,v,c) : the intermediate value of soln(l,n,v) after the 1st to the cth child of l are scanned Err(soln(l,n,v,c+1))=min0kn(Err(soln(l,n,v,c))+Err(soln(c+1,n-k,v))) Err(soln(l,n,v))=min(Err(soln(l,n,v,*)), minv  v’ Err(soln(1,n-1,v’,*)+rep(v’)))

soln(1,1,*) + - 1.2 : - 1.1 : + + - + + - - 1.2.1 : + 1.3 : + 1.2 : - 1.1 : + 1.2 : - 1.2 : - 1 : + 1.1 : + 1.1.4 : - 1.4 : + 1.2.1 :+ 1.4 : + 10 10 15 13 + - Error 8 10 9 14 13 19 N=3 + + + - - - 1.1.4 : - 1.1.4 : - 1.1.4 : - 1.1 : + 1.1 : + 1.1 : + + - 1.1.8 : - 1.1.8 : - 1.1.8 : - 1.1.4 : - 1.1.4 : - 1.1.4 : - 1.2 : - 1.2.1 : + 1.1.9 : - 1.1.9 : - 1.1.9 : - 1.1.8 : - 1.1.8 : - 1.1.8 : - 1.2.1 : + 1.2.3 : + Error 0 0 0 1 1 1 1.2.3 : + 0 0 N=3 N=2 N=1 N=0 1 1.1 (+) 1.2 (-) 1.3 (+) 1.4 (+) + + + - + + + - - + 1 2 3 4 5 6 7 8 9 10 + - + - - - - 1 2 3 4 5 6 7 + + - + + + 1 2 3 4 5 6 • - - + + + + - + • 1 2 3 4 5 6 7 8 9 soln(1.1,3,*) soln(1.2,3,*) soln(1.3,3,*) soln(1.4,3,*)

The Inform operator

User-cognizant data exploration: overview • Monitor to find regions of data user has visited • Model user’s expectation of unseen values • Report most informative unseen values • How to • Model expected values? • Define information content?

Views seen by user Database hidden from user Modeling expected values

The Maximum Entropy Principle • Choose the most uniform distribution while adhering to all the constraints • E.T.Jaynes..[1990] it agrees with everything that is known but carefully avoids assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom… • Characterizing uniformity: maximum when all pi-s are equal • Solve the constrained optimization problem: • maximize H(p) subject to k constraints

Modeling expected values Visited views Database

Change in entropy

Finding expected values • Solve the constrained optimization problem: • maximize H(p) subject to k constraints • Each constraint is of the form: sum of arbitrary sets of values • Expected values can be expressed as a product of k coefficients one from each of the k constraints

Iterative scaling algorithm Initially all p values are the same While convergence not reached For each constraint Ci in turn Scale p values included in Ci by Converges to optimal solution when all constraints are consistent.

Information content of an unvisited cell • Defined as how much adding it as a constraint will reduce distance between actual and expected values • Distance between actual and expected: • Information content of (k+1)th constraint Ck+1: • Can be approximated as:

Information content of unseen data

Adapting for OLAP data:Optimization 1: Expand expected cube on demand • Single entry for all cells with same expected value • Initially everything aggregated but touches lot of data • Later constraints touch limited amount of data. Expected cube Views

Optimization 2: Reduce overlap • Number of iterations depend on overlap between constraints • Remove subsumed constraints from their parents to reduce overlap

Finding N most informative cells • In general, most informative cells can be any of value from any level of aggregation. • Single-pass algorithm that finds the best difference between actual and expected values [VLDB-99]

Information gain with focussed exploration

Illustration from Student enrollment data 35% of information in data captured in 12 out of 4560 cells: 0.25% of data

Top few suprising values 80% of information in data captured in 50 out of 4560 cells: 1% of data

Summary • Our goal: enhance OLAP with a suite of operations that are • richer than simple OLAP and SQL queries • more interactive than conventional mining ...and thus reduce the need for manual analysis • Proposed three new operators: Diff, Generalize, Surprise • Formulations with theoretical basis • Efficient algorithms for online answering • Integrates smoothly with existing systems. • Future work: More operators.

Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita