1 / 28

Bellwether Analysis Predicting Global Aggregates from Local Regions

Bellwether Analysis Predicting Global Aggregates from Local Regions. Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison. Bee-Chung Chen, Jude Shavlik, Pradeep Tamma University of Wisconsin—Madison. Motivating Example.

Télécharger la présentation

Bellwether Analysis Predicting Global Aggregates from Local Regions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bellwether AnalysisPredicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison Bee-Chung Chen, Jude Shavlik, Pradeep Tamma University of Wisconsin—Madison

  2. Motivating Example • A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) by using its historical database • By looking at the features and profits of previous (similar) movies, we want to predict the expected total profit (total US sales at the end of the release year) for the new movie • Wait a year and write a query! If you can’t wait, read this paper • The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). • Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc. • Gathering this data has a cost (e.g., marketing expenses, waiting time) • Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”

  3. Key Ideas • Large datasets are rarely labeled with the targets that we wish to learn to predict • But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining • We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result • The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case

  4. Outline • Motivating example • Basic bellwether analysis • Subset bellwether analysis • Bellwether trees • Bellwether cubes • Experimental results • Conclusion

  5. Motivating Example • A company wants to predict the first year’s worldwide profit for a new item, by using its historical database • Database Schema: • The combination of the underlined attributes forms a key

  6. A Straightforward Approach • Build a regression model to predict item profit • There is much room for accuracy improvement! By joining and aggregating tables in the historical database we can create a training set: Item-table features Target An Example regression model: Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense

  7. Using Regional Features • Example region: [1st week, Korea] • Regional features: • Regional Profit: The 1st week profit in Korea • Regional Ad Expense: The 1st week ad expense in Korea • A possibly more accurate model: Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense + 4Profit[1wk, KR] + 5AdExpense[1wk, KR] • Problem: Which region should we use? • The smallest region that improves the accuracy the most • We give each candidate region a cost • The most “cost-effective” region is the bellwether region

  8. Basic Bellwether Problem

  9. Basic Bellwether Problem Location domain hierarchy • Historical database:DB • Training item set:I • Candidate region set:R • E.g., { [1-n week, Location] } • Target generation query:i(DB) returns the target value of item i  I • E.g., sum(Profit) i, [1-52, All] ProfitTable • Feature generation query:i,r(DB), i  Ir and r  R • Ir: The set of items in region r • E.g., [ Categoryi, RdExpensei, Profiti, [1-n, Loc], AdExpensei, [1-n, Loc] ] • Cost query:r(DB), r R, the cost of collecting data from r • Predictive model:hr(x), r R,trained on {(i,r(DB), i(DB)) : i  Ir} • E.g., linear regression model

  10. Basic Bellwether Problem Features i,r(DB) Target i(DB) Total Profit in [1-52, All] Aggregate over data records in region r = [1-2, USA] r • For each region r, build a predictive model hr(x); and then choose bellwether region: • Coverage(r) fraction of all items in region  minimum coverage support • Cost(r,DB) cost threshold • Error(hr) is minimized

  11. Experiment on a Mail Order Dataset Error-vs-Budget Plot • Bel Err: The error of the bellwether region found using a given budget • Avg Err: The average error of all the cube regions with costs under a given budget • Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget [1-8 month, MD] (RMSE: Root Mean Square Error)

  12. Experiment on a Mail Order Dataset Uniqueness Plot • Y-axis: Fraction of regions that are as good as the bellwether region • The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region • We have 99% confidence that that [1-8 month, MD] is a quite unique bellwether region [1-8 month, MD]

  13. Basic Bellwether Computation • OLAP-style bellwether analysis • Candidate regions: Regions in a data cube • Queries: OLAP-style aggregate queries • E.g., Sum(Profit) over a region • Efficient computation: • Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01) • Infeasible regions: Regions with cost > B or coverage < C • Share computation by generating the features and target values for all the feasible regions all together • Exploit distributive and algebraic aggregate functions • Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation

  14. Subset Bellwether Problem

  15. Subset-Based Bellwether Prediction • Motivation: Different subsets of items may have different bellwether regions • E.g., The bellwether region for laptops may be different from the bellwether region for clothes • Two approaches: Bellwether Cube Bellwether Tree R&D Expenses Category

  16. Bellwether Tree • How to build a bellwether tree • Similar to regression tree construction • Starting from the root node, recursively split the current leaf node using the “best split criterion” • A split criterion partitions a set of items into disjoint subsets • Pick the split that reduces the error the most • Stop splitting when the number of items in the current leaf node falls under a threshold value • Prune the tree to avoid overfitting 1 2 7 3 4 8 9 5 6

  17. Problem of Naïve Tree Construction • A naïve bellwether tree construction algorithm will scan the dataset nm times • n is the number of nodes • m is the number of candidate split criteria • Idea: Extending the RainForest framework [Gehrke et al., 98] 1 • For each node: • Try all candidate split criteria to find the best one • It needs to scan the dataset m times 2 7 3 4 8 9 5 6

  18. Bellwether Cube R&D Expenses Category Rollup Drilldown R&D Expenses Category The number in a cell is the error of the bellwether region for that subset of items

  19. Problem of Naïve Cube Construction • A naïve bellwether cube construction algorithm will conduct a basic bellwether search for the subset of items in each cell • A basic bellwether search involves building a model for each candidate region • For each cell: • Build a model for each • candidate region

  20. Efficient Cube Construction • Idea: Transform model construction into computation of distributive or algebraic aggregate functions • Let S1,…, Sn partition S • S = S1 …  Sn and Si  Sj = • Distributive function: (S) = F({(S1), …, (Sn)}) • E.g., Count(S) = Sum({Count(S1), …, Count(Sn)}) • Algebraic function: (S) = F({G(S1), …, G(Sn)}) • G(Si) returns a length-fixed vector of values • E.g., Avg(S) = F({G(S1), …, G(Sn)}) • G(Si) = [Sum(Si), Count(Si)] • F({[a1, b1], …, [an, bn]}) = Sum({ai}) / Sum({bi})

  21. Efficient Cube Construction • Build models for each finest-grained cells • For higher-level cells, use data cube computation techniques to compute the aggregate functions • For each finest-grained cell: • Build models to find the • bellwether region • For each higher-level cell: • Compute aggregate functions • to find the bellwether region

  22. Efficient Cube Construction • Classification models: • Use the prediction cube [Chen et al., 05] execution framework • Regression models:(Weighted linear regression model; builds on work in Chen-Dong-Han-Wah-Wang VLDB 02) • Having the sum of squared error (SSE) for each candidate region is sufficient to find the bellwether region • SSE(S) is an algebraic function, where S is a set of item • SSE(S) = q( { g(Sk) : k = 1, …, n } ) • S1, …, Sn partition S • g(Sk) = YkWkYk, XkWkXk, XkWkYk • q({Ak, Bk, Ck : k = 1, …, n}) = kAk  (kCk)(kBk)1(kCk) where Yk is the vector of target values for set Sk of items Xk is the matrix of features for set Sk of items Wk is the weight matrix for set Sk of items

  23. Experimental Results

  24. Experimental Results: Summary • We have shown the existence of bellwether regions on a real mail-order dataset • We characterize the behavior of bellwether trees and bellwether cubes using synthetic datasets • We show our computation techniques improve efficiency by orders of magnitude • We show our computation techniques scale linearly in the size of the dataset

  25. Characteristics of Bellwether Trees & Cubes • Result: • Bellwether trees & cubes have better accuracy than basic bellwether search • Increase noise  increase error • Increase complexity  increase error • Dataset generation: • Use random tree to generate • different bellwether regions • for different subset of items • Parameters: • Noise • Concept complexity: # of tree nodes 15 nodes Noise level: 0.5

  26. Efficiency Comparison Naïve computation methods Our computation techniques

  27. Scalability

  28. Conclusion { } Database Subset Selection & Data Mining Multi - Dimension al View Aggregation • Promising data mining paradigm: • Using OLAP queries to generate features and even targets for mining • Using data-mining models as building blocks in the mining process, rather than thinking of them as the end result • Exploit the nested structure of OLAP queries to achieve efficient computation

More Related