1 / 34

Efficient Allocation Algorithms For OLAP Over Imprecise Data

Efficient Allocation Algorithms For OLAP Over Imprecise Data. Doug Burdick University of Wisconsin – Madison. Prasad Deshpande IBM India Research Lab, SIRC. T.S. Jayram IBM Almaden Research Center. Raghu Ramakrishnan Yahoo! Research. Shivakumar Vaithyanathan IBM Almaden Research Center.

kirkan
Télécharger la présentation

Efficient Allocation Algorithms For OLAP Over Imprecise Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Allocation Algorithms For OLAP Over Imprecise Data Doug Burdick University of Wisconsin – Madison Prasad Deshpande IBM India Research Lab, SIRC T.S. Jayram IBM Almaden Research Center Raghu Ramakrishnan Yahoo! Research Shivakumar Vaithyanathan IBM Almaden Research Center

  2. Imprecise Data Multidimensional Data AUTOMOBILE 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region 1 Model Civic Camry F150 Sierra p3 p4 MA p5 East NY p1 p2 ALL LOCATION TX West CA • [BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005

  3. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

  4. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

  5. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal

  6. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

  7. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

  8. Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

  9. Sources of Imprecision • Data Integration • Fact table constructed by integrating multiple data sources • Different sources record same dimension attribute at different granularities AUTOMOBILE 3 ALL ALL 2 Category Truck Sedan 1 Model Civic Camry F150 Sierra Mailing List Call Center

  10. Imprecision In Real Data • Obtained real-world dataset from auto manufacturer • Fact table entries from several source relations • Integrated fact table contained 798,570 facts • Real data has many imprecise facts

  11. Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 MA p3 p4 East NY p1 p2

  12. Solution: Allocation • Intuitively: Replace each imprecise fact r with set of precise facts, one for each possible completion of r • Each completion is assigned an allocation weight • Refer to the resulting fact table as the Extended Database (EDB) • Queries operate over this Extended Database

  13. F150 Sierra Handle Imprecision With Allocation Truck p5 p5 MA p3 p4 East NY p1 p2

  14. Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2

  15. Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = 150 Procedure for assigning allocation weights is referred to as an allocation policy Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2

  16. Contributions • Propose generalized template for allocation policies presented in [BDJ+05] • Present operational framework for allocation • Allocation graph formalism • Used to derive Independent, Block, Transitive Algorithms • Propose Extended Database Maintenance Algorithm • Update EDB to reflect changes to given fact table • Experimental Evaluation

  17. F150 Sierra Allocation Policy Template Truck r MA c2 c1 East NY

  18. Interactions between overlapping facts • Allocation weights for imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa) • Would like assigned weights to capture these interactions • Idea: Repeatedly allocate p6 and p7 until allocation weights converge Truck F150 Sierra p5 p6 MA p4 p7 East NY p1 p2

  19. Iterative Allocation Policies 1) Initialize each Q0(c) in cell c (using precise facts) 2) For each iteration t until all Qt(c)converged For each imprecise fact r For each cell c For each imprecise fact r overlapping c 3) For each imprecise fact r For each cell c in region(r)

  20. Benefits of Iterative Allocation • Imprecise facts can be allocated in any order and same allocation weights are obtained • Leverage this idea to obtain scalable allocation algorithms • Leads to Expectation Maximization (EM) framework for allocation • Final allocation weights have pleasing mathematical properties • See [BDJ+05] for details

  21. Allocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 MA MA p3 p3 p4 p4 p6 p6 c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) Cell(MA,F150) <MA,Truck> Cell(MA,Sierra)

  22. Processing WithAllocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 p5 MA MA p3 p3 p4 p4 p6 Initialize each Q0(c) in cell c c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) 2 / 3 2 Cell(MA,F150) 3 <MA,Truck> 1 Cell(MA,Sierra) 1 / 3

  23. Efficient Allocation Algorithms • Independent Algorithm • Requires multiple sorts of precise cells for each iteration • Optimizations based on re-using each sort as much as possible • Block Algorithm • Reduces the number of required sorts for precise cells to 1 • Optimizations based on increasing buffer utilization

  24. S1:<State,Category> S2:<State, ALL> S3 :<Region,Category> S4 :<ALL,Model> S5 :<Region,Model> <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> p1 <CA,ALL> p8 p2 <MA,Sierra> <East,Truck> p9 <West,Sedan> p10 p3 <NY,F150> <ALL,Civic> p11 p4 <CA,Civic> <ALL,Sierra> p12 p5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

  25. Iteration aware allocation • Optimizations for Independent and Block reduce work for single iteration • Problem: Each iteration of allocation is still expensive • Involves multiple scans of entire fact table • Not feasible for real data warehouses! • Can we do better?

  26. Required Data For Allocating A Fact <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> c1 <CA,ALL> p8 c2 <MA,Sierra> <East,Truck> ` p9 <West,Sedan> p10 c3 <NY,F150> <ALL,Civic> p11 c4 <CA,Civic> <ALL,Sierra> p12 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

  27. Required Data For Allocating A Fact p7 <MA,Truck> c2 <MA,Sierra> <East,Truck> p9 c3 Connected components in allocation graph can be processed independently <NY,F150> <ALL,Sierra> p12 <MA,Sedan> p6 <CA,ALL> p8 <MA,Civic> c1 <West,Sedan> p10 c4 <CA,Civic> <ALL,Civic> p11 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

  28. Transitive Algorithm • Transitive Algorithm has two steps: • 1) Connected component identification step • 2) Process each connected component • Read component into memory • Perform all iterations of allocation for facts in component • If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! • Components larger than buffer processed using Block algorithm • In real datasets, all components were memory resident Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm

  29. Experimental Setup • Algorithms evaluated on several datasets • Real-world dataset: 798K facts , 4 dimensions • Used several synthetic datasets • Vary level of imprecision in the data • Percentage of imprecise facts • Severity of imprecision • Scalability (up to 5 million tuples) • Important parameter: Ratio of input table size to available memory • Memory limited to restricted buffer pool

  30. Experiment 1a: Memory Resident Real Dataset

  31. Experiment: Memory Resident (2) Synthetic Dataset (more imprecision)

  32. Experiment: Algorithm Scalability

  33. Experiment 1b: Algorithm Scalability

  34. Conclusions • Imprecision is a compelling real-world problem • Propose allocation as a solution • Allocation graph formalism • Basis for 3 scalable allocation algorithms • Independent, Block, Transitive • Transitive algorithm is quite intriguing • Performance is stable as number of iterations increase • Connected components algorithm identifies can be used in proposed EDB maintenance algorithm

More Related