Prediction Cubes: Exploratory Analysis of Data Mining Choices
Discover interesting subsets of data through prediction cubes, representing decision behavior. Understand the entire space of data mining choices using this innovative approach.
Prediction Cubes: Exploratory Analysis of Data Mining Choices
E N D
Presentation Transcript
Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison
Subset Mining • We want to find interesting subsets of the dataset • Interestingness: Defined by the “model” built on a subset • Cube space: A combination of dimension attribute values defines a candidate subset (just like regular OLAP) • We want the measures to represent decision/prediction behavior • Summarize a subset using the “model” built on it • Big change from regular OLAP!
The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior • In effect, build a tree for each cell/region in the cube—observe that this is not the same as a collection of trees used in an ensemble method! • The idea is simple, but it leads to promising data mining tools • Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters …
Example (1/7): Regular OLAP Location Time Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications:
Example (2/7): Regular OLAP 04 03 … Coarser regions CA 100 90 … USA 80 90 … … … … … 2004 … Jan … Dec … Roll up CA AB 20 15 15 … Drill down … 5 2 20 … 2004 2003 … YT 5 3 15 … Jan … Dec Jan … Dec … USA AL 55 … … … CA 30 20 50 25 30 … … … 5 … … USA 70 2 8 10 … … … WY 10 … … … … … … … … … … … … … … … … … Cell value: Number of loan applications Goal: Look for patterns of unusually high numbers of applications: Z: Dimensions Y: Measure Finer regions
Example (3/7): Decision Analysis Cube subset Location Time Race Sex … Approval Model h(X, Z(D)) E.g., decision tree AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No Location Time Goal: Analyze a bank’s loan decision processw.r.t. two dimensions: Location and Time Fact table D Z: Dimensions X: Predictors Y: Class
Example (3/7): Decision Analysis • Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)? • Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?” • Are there branches and times with decision making reminiscent of 1950s Alabama? • Requires comparison of classifiers trained using different subsets of data.
Example (4/7): Prediction Cubes Data [USA, Dec 04](D) Location Time Race Sex … Approval AL ,USA Dec, 04 White M … Y … … … … … … WY, USA Dec, 04 Black F … N Model h(X, [USA, Dec 04](D)) E.g., decision tree • Build a model using data from USA in Dec., 1985 • Evaluate that model • Measure in a cell: • Accuracy of the model • Predictiveness of Race • measured based on that • model • Similarity between that • model and a given model
Example (5/7): Model-Similarity Data table D Location Time Race Sex … Approval AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.2 0.3 0.6 0.5 … … Build a model USA 0.2 0.3 0.9 … … … … … … … … … … … Similarity Race Sex … Level: [Country, Month] Yes Yes White F … … … … … … No Yes Black M … h0(X) Test set Given: - Data table D - Target model h0(X) - Test set w/o labels The loan decision process in USA during Dec 04 was similar to a discriminatory decision model
Example (6/7): Predictiveness Given: - Data table D - Attributes V - Test set w/o labels Data table D Yes No . . Yes Yes No . . No Build models h(XV) h(X) Level: [Country, Month] Predictiveness of V Race was an important predictor of loan approval decision in USA during Dec 04 Test set
Example (7/7): Prediction Cube Roll up 04 03 … CA 0.3 0.2 … USA 0.2 0.3 … … … … … 2004 2003 … Jan … Dec Jan … Dec … CA AB 0.4 0.2 0.1 0.1 0.2 … … … 0.1 0.1 0.3 0.3 … … … YT 0.3 0.2 0.1 0.2 … … … USA AL 0.2 0.1 0.2 … … … … Drill down … 0.3 0.1 0.1 … … … WY 0.9 0.7 0.8 … … … … … … … … … … … … … Cell value: Predictiveness of Race
Efficient Computation • Reduce prediction cube computation to data cube computation • Represent a data-mining model as a distributive or algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied
Bottom-Up Data Cube Computation Cell Values: Numbers of loan applications
Functions on Sets • Bottom-up computable functions: Functions that can be computed using only summary information • Distributive function: (X) = F({(X1), …, (Xn)}) • X = X1 … Xn and Xi Xj = • E.g., Count(X) = Sum({Count(X1), …, Count(Xn)}) • Algebraic function: (X) = F({G(X1), …, G(Xn)}) • G(Xi) returns a length-fixed vector of values • E.g., Avg(X) = F({G(X1), …, G(Xn)}) • G(Xi) = [Sum(Xi), Count(Xi)] • F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
Scoring Function • Represent a model as a function of sets • Conceptually, a machine-learning model h(X; Z(D)) is a scoring function Score(y, x; Z(D)) that gives each class y a score on test example x • h(x; Z(D)) = argmax y Score(y, x; Z(D)) • Score(y, x; Z(D)) p(y | x, Z(D)) • Z(D): The set of training examples (a cube subset of D)
Bottom-up Score Computation • Key observations: • Observation 1:Score(y, x; Z(D)) is a function of cube subset Z(D); if it is distributive or algebraic, the data cube bottom-up technique can be directly applied • Observation 2: Having the scores for all the test examples and all the cells is sufficient to compute a prediction cube • Scores predictions cell values • Details depend on what each cell means (i.e., type of prediction cubes); but straightforward
Machine-Learning Models • Naïve Bayes: • Scoring function: algebraic • Kernel-density-based classifier: • Scoring function: distributive • Decision tree, random forest: • Neither distributive, nor algebraic • PBE: Probability-based ensemble (new) • To make any machine-learning model distributive • Approximation
Probability-Based Ensemble PBE version of decision tree on [WA, 85] Decision tree on [WA, 85] Decision trees built on the lowest-level cells
Probability-Based Ensemble • Scoring function: • h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D)) • g(bi | x): A model that predicts the probability that x belongs to base subset bi(D)
Outline • Motivating example • Definition of prediction cubes • Efficient prediction cube materialization • Experimental results • Conclusion
Experiments 1985 1985 … … WA … WA … … … … … • Quality of PBE on 8 UCI datasets • The quality of the PBE version of a model is slightly worse (0 ~ 6%) than the quality of the model trained directly on the whole training data. • Efficiency of the bottom-up score computation technique • Case study on demographic data PBE vs.
Efficiency of Bottom-up Score Computation • Machine-learning models: • J48: J48 decision tree • RF: Random forest • NB: Naïve Bayes • KDC: Kernel-density-based classifier • Bottom-up method vs. Exhaustive method • PBE-J48 • PBE-RF • NB • KDC • J48ex • RFex • NBex • KDCex
Synthetic Dataset • Dimensions: Z1, Z2 and Z3. • Decision rule: Z1 and Z2 Z3
Efficiency Comparison Using exhaustive method Execution Time (sec) Using bottom-up score computation # of Records
Related Work: Building models on OLAP Results • Multi-dimensional regression [Chen, VLDB 02] • Goal: Detect changes of trends • Build linear regression models for cube cells • Step-by-step regression in stream cubes [Liu, PAKDD 03] • Loglinear-based quasi cubes [Barbara, J. IIS 01] • Use loglinear model to approximately compress dense regions of a data cube • NetCube [Margaritis, VLDB 01] • Build Bayes Net on the entire dataset of approximate answer count queries
Related Work (Contd.) • Cubegrades [Imielinski, J. DMKD 02] • Extend cubes with ideas from association rules • How does the measure change when we rollup or drill down? • Constrained gradients [Dong, VLDB 01] • Find pairs of similar cell characteristics associated with big changes in measure • User-cognizant multidimensional analysis [Sarawagi, VLDBJ 01] • Help users find the most informative unvisited regions in a data cube using max entropy principle • Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05]
Take-Home Messages • Promising exploratory data analysis paradigm: • Can use models to identify interesting subsets • Concentrate only on subsets in cube space • Those are meaningful subsets, tractable • Precompute results and provide the users with an interactive tool • A simple way to plug “something” into cube-style analysis: • Try to describe/approximate “something” by a distributive or algebraic function
Big Picture • Why stop with decision behavior? Can apply to other kinds of analyses too • Why stop at browsing? Can mine prediction cubes in their own right • Exploratory analysis of mining space: • Dimension attributes can be parameters related to algorithm, data conditioning, etc. • Tractable evaluation is a challenge: • Large number of “dimensions”, real-valued dimension attributes, difficulties in compositional evaluation • Active learning for experiment design, extending compositional methods
Community Information Management (CIM) UI Anhai Doan University of Illinois at Urbana-Champaign Raghu Ramakrishnan University of Wisconsin-Madison
Structured Web-Queries UI • Example Queries: • How many alumni are top-10 faculty members? • Wisconsin does very well, by the way • Find trends in publications • By topic, by conference, by alumni of schools • Change tracking • Alert me if my co-authors publish new papers or move to new jobs • Information is extracted from text sources on the web, then queried
Key Ideas UI • Communities are ideally scoped chunks of the web for which to build enhanced portals • Relative uniformity in content, interests • Can exploit “people power” via mass collaboration, to augment extraction • CIM platform: Facilitate collaborative creation and maintenance of community portals • Extraction management • Uncertainty, provenance, maintenance, compositional inference for refining extracted information • Mass collaboration for extraction and integration Watch for new DBWorld!
Challenges UI • User Interaction • Declarative specification of background knowledge and user feedback • Intelligent prompting for user input • Explanation of results
Challenges UI • Extraction and Query Plans • Starting from user input (ER schema, hints) and background knowledge (e.g., standard types, look-up tables), compile a query into an execution plan • Must cover extraction, storage and indexing, and relational processing • And maintenance! • Algebra to represent such plans? Query optimizer? • Handling uncertainty, constraints, conflicts, multiple related sources, ranking, modular architecture
Challenges UI • Managing extracted data • Mapping between extracted metadata and source data • Uncertainty of mapping • Conflicts (in user input, background knowledge, or from multiple sources) • Evolution over time