Bellwether Analysis Predicting Global Aggregates from Local Regions

Bellwether AnalysisPredicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison Bee-Chung Chen, Jude Shavlik, Pradeep Tamma University of Wisconsin—Madison

Motivating Example • A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) by using its historical database • By looking at the features and profits of previous (similar) movies, we want to predict the expected total profit (total US sales at the end of the release year) for the new movie • Wait a year and write a query! If you can’t wait, read this paper • The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). • Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc. • Gathering this data has a cost (e.g., marketing expenses, waiting time) • Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”

Key Ideas • Large datasets are rarely labeled with the targets that we wish to learn to predict • But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining • We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result • The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case

Outline • Motivating example • Basic bellwether analysis • Subset bellwether analysis • Bellwether trees • Bellwether cubes • Experimental results • Conclusion

Motivating Example • A company wants to predict the first year’s worldwide profit for a new item, by using its historical database • Database Schema: • The combination of the underlined attributes forms a key

A Straightforward Approach • Build a regression model to predict item profit • There is much room for accuracy improvement! By joining and aggregating tables in the historical database we can create a training set: Item-table features Target An Example regression model: Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense

Using Regional Features • Example region: [1st week, Korea] • Regional features: • Regional Profit: The 1st week profit in Korea • Regional Ad Expense: The 1st week ad expense in Korea • A possibly more accurate model: Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense + 4Profit[1wk, KR] + 5AdExpense[1wk, KR] • Problem: Which region should we use? • The smallest region that improves the accuracy the most • We give each candidate region a cost • The most “cost-effective” region is the bellwether region

Basic Bellwether Problem

Basic Bellwether Problem Location domain hierarchy • Historical database:DB • Training item set:I • Candidate region set:R • E.g., { [1-n week, Location] } • Target generation query:i(DB) returns the target value of item i  I • E.g., sum(Profit) i, [1-52, All] ProfitTable • Feature generation query:i,r(DB), i  Ir and r  R • Ir: The set of items in region r • E.g., [ Categoryi, RdExpensei, Profiti, [1-n, Loc], AdExpensei, [1-n, Loc] ] • Cost query:r(DB), r R, the cost of collecting data from r • Predictive model:hr(x), r R,trained on {(i,r(DB), i(DB)) : i  Ir} • E.g., linear regression model

Basic Bellwether Problem Features i,r(DB) Target i(DB) Total Profit in [1-52, All] Aggregate over data records in region r = [1-2, USA] r • For each region r, build a predictive model hr(x); and then choose bellwether region: • Coverage(r) fraction of all items in region  minimum coverage support • Cost(r,DB) cost threshold • Error(hr) is minimized

Experiment on a Mail Order Dataset Error-vs-Budget Plot • Bel Err: The error of the bellwether region found using a given budget • Avg Err: The average error of all the cube regions with costs under a given budget • Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget [1-8 month, MD] (RMSE: Root Mean Square Error)

Experiment on a Mail Order Dataset Uniqueness Plot • Y-axis: Fraction of regions that are as good as the bellwether region • The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region • We have 99% confidence that that [1-8 month, MD] is a quite unique bellwether region [1-8 month, MD]

Basic Bellwether Computation • OLAP-style bellwether analysis • Candidate regions: Regions in a data cube • Queries: OLAP-style aggregate queries • E.g., Sum(Profit) over a region • Efficient computation: • Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01) • Infeasible regions: Regions with cost > B or coverage < C • Share computation by generating the features and target values for all the feasible regions all together • Exploit distributive and algebraic aggregate functions • Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation

Subset Bellwether Problem

Subset-Based Bellwether Prediction • Motivation: Different subsets of items may have different bellwether regions • E.g., The bellwether region for laptops may be different from the bellwether region for clothes • Two approaches: Bellwether Cube Bellwether Tree R&D Expenses Category

Bellwether Tree • How to build a bellwether tree • Similar to regression tree construction • Starting from the root node, recursively split the current leaf node using the “best split criterion” • A split criterion partitions a set of items into disjoint subsets • Pick the split that reduces the error the most • Stop splitting when the number of items in the current leaf node falls under a threshold value • Prune the tree to avoid overfitting 1 2 7 3 4 8 9 5 6

Problem of Naïve Tree Construction • A naïve bellwether tree construction algorithm will scan the dataset nm times • n is the number of nodes • m is the number of candidate split criteria • Idea: Extending the RainForest framework [Gehrke et al., 98] 1 • For each node: • Try all candidate split criteria to find the best one • It needs to scan the dataset m times 2 7 3 4 8 9 5 6

Bellwether Cube R&D Expenses Category Rollup Drilldown R&D Expenses Category The number in a cell is the error of the bellwether region for that subset of items

Problem of Naïve Cube Construction • A naïve bellwether cube construction algorithm will conduct a basic bellwether search for the subset of items in each cell • A basic bellwether search involves building a model for each candidate region • For each cell: • Build a model for each • candidate region

Efficient Cube Construction • Idea: Transform model construction into computation of distributive or algebraic aggregate functions • Let S1,…, Sn partition S • S = S1 …  Sn and Si  Sj = • Distributive function: (S) = F({(S1), …, (Sn)}) • E.g., Count(S) = Sum({Count(S1), …, Count(Sn)}) • Algebraic function: (S) = F({G(S1), …, G(Sn)}) • G(Si) returns a length-fixed vector of values • E.g., Avg(S) = F({G(S1), …, G(Sn)}) • G(Si) = [Sum(Si), Count(Si)] • F({[a1, b1], …, [an, bn]}) = Sum({ai}) / Sum({bi})

Efficient Cube Construction • Build models for each finest-grained cells • For higher-level cells, use data cube computation techniques to compute the aggregate functions • For each finest-grained cell: • Build models to find the • bellwether region • For each higher-level cell: • Compute aggregate functions • to find the bellwether region

Efficient Cube Construction • Classification models: • Use the prediction cube [Chen et al., 05] execution framework • Regression models:(Weighted linear regression model; builds on work in Chen-Dong-Han-Wah-Wang VLDB 02) • Having the sum of squared error (SSE) for each candidate region is sufficient to find the bellwether region • SSE(S) is an algebraic function, where S is a set of item • SSE(S) = q( { g(Sk) : k = 1, …, n } ) • S1, …, Sn partition S • g(Sk) = YkWkYk, XkWkXk, XkWkYk • q({Ak, Bk, Ck : k = 1, …, n}) = kAk  (kCk)(kBk)1(kCk) where Yk is the vector of target values for set Sk of items Xk is the matrix of features for set Sk of items Wk is the weight matrix for set Sk of items

Experimental Results

Experimental Results: Summary • We have shown the existence of bellwether regions on a real mail-order dataset • We characterize the behavior of bellwether trees and bellwether cubes using synthetic datasets • We show our computation techniques improve efficiency by orders of magnitude • We show our computation techniques scale linearly in the size of the dataset

Characteristics of Bellwether Trees & Cubes • Result: • Bellwether trees & cubes have better accuracy than basic bellwether search • Increase noise  increase error • Increase complexity  increase error • Dataset generation: • Use random tree to generate • different bellwether regions • for different subset of items • Parameters: • Noise • Concept complexity: # of tree nodes 15 nodes Noise level: 0.5

Efficiency Comparison Naïve computation methods Our computation techniques

Scalability

Conclusion { } Database Subset Selection & Data Mining Multi - Dimension al View Aggregation • Promising data mining paradigm: • Using OLAP queries to generate features and even targets for mining • Using data-mining models as building blocks in the mining process, rather than thinking of them as the end result • Exploit the nested structure of OLAP queries to achieve efficient computation

Bellwether Analysis Predicting Global Aggregates from Local Regions

Bellwether Analysis Predicting Global Aggregates from Local Regions

Presentation Transcript

AGGREGATES

From local to global : ray tracing

Aggregates

Aggregates

Antimicrobial Resistance: from Global to Local

AGGREGATES There are two types of aggregates Coarse Aggregates Fine Aggregates

Aggregates

AGGREGATES

Bellwether Software LLC

aggregates

Leaf Classification from Local Boundary Analysis

From the Global to the Local:

From Local Village to Global Village

Aggregates

From Global to Local and from Local to Global. Can we meet ?

Predicting function from sequence

From local to global : ray tracing

Global Construction Aggregates Market : Share, Trends, Analysis, Size to 2021

AGGREGATES

From local to global : ray tracing

Leaf Classification from Local Boundary Analysis