Efficient Allocation Algorithms For OLAP Over Imprecise Data

Efficient Allocation Algorithms For OLAP Over Imprecise Data Doug Burdick University of Wisconsin – Madison Prasad Deshpande IBM India Research Lab, SIRC T.S. Jayram IBM Almaden Research Center Raghu Ramakrishnan Yahoo! Research Shivakumar Vaithyanathan IBM Almaden Research Center

Imprecise Data Multidimensional Data AUTOMOBILE 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region 1 Model Civic Camry F150 Sierra p3 p4 MA p5 East NY p1 p2 ALL LOCATION TX West CA • [BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005

Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal

Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values

Sources of Imprecision • Data Integration • Fact table constructed by integrating multiple data sources • Different sources record same dimension attribute at different granularities AUTOMOBILE 3 ALL ALL 2 Category Truck Sedan 1 Model Civic Camry F150 Sierra Mailing List Call Center

Imprecision In Real Data • Obtained real-world dataset from auto manufacturer • Fact table entries from several source relations • Integrated fact table contained 798,570 facts • Real data has many imprecise facts

Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 MA p3 p4 East NY p1 p2

Solution: Allocation • Intuitively: Replace each imprecise fact r with set of precise facts, one for each possible completion of r • Each completion is assigned an allocation weight • Refer to the resulting fact table as the Extended Database (EDB) • Queries operate over this Extended Database

F150 Sierra Handle Imprecision With Allocation Truck p5 p5 MA p3 p4 East NY p1 p2

Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2

Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = 150 Procedure for assigning allocation weights is referred to as an allocation policy Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2

Contributions • Propose generalized template for allocation policies presented in [BDJ+05] • Present operational framework for allocation • Allocation graph formalism • Used to derive Independent, Block, Transitive Algorithms • Propose Extended Database Maintenance Algorithm • Update EDB to reflect changes to given fact table • Experimental Evaluation

F150 Sierra Allocation Policy Template Truck r MA c2 c1 East NY

Interactions between overlapping facts • Allocation weights for imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa) • Would like assigned weights to capture these interactions • Idea: Repeatedly allocate p6 and p7 until allocation weights converge Truck F150 Sierra p5 p6 MA p4 p7 East NY p1 p2

Iterative Allocation Policies 1) Initialize each Q0(c) in cell c (using precise facts) 2) For each iteration t until all Qt(c)converged For each imprecise fact r For each cell c For each imprecise fact r overlapping c 3) For each imprecise fact r For each cell c in region(r)

Benefits of Iterative Allocation • Imprecise facts can be allocated in any order and same allocation weights are obtained • Leverage this idea to obtain scalable allocation algorithms • Leads to Expectation Maximization (EM) framework for allocation • Final allocation weights have pleasing mathematical properties • See [BDJ+05] for details

Allocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 MA MA p3 p3 p4 p4 p6 p6 c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) Cell(MA,F150) <MA,Truck> Cell(MA,Sierra)

Processing WithAllocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 p5 MA MA p3 p3 p4 p4 p6 Initialize each Q0(c) in cell c c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) 2 / 3 2 Cell(MA,F150) 3 <MA,Truck> 1 Cell(MA,Sierra) 1 / 3

Efficient Allocation Algorithms • Independent Algorithm • Requires multiple sorts of precise cells for each iteration • Optimizations based on re-using each sort as much as possible • Block Algorithm • Reduces the number of required sorts for precise cells to 1 • Optimizations based on increasing buffer utilization

S1:<State,Category> S2:<State, ALL> S3 :<Region,Category> S4 :<ALL,Model> S5 :<Region,Model> <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> p1 <CA,ALL> p8 p2 <MA,Sierra> <East,Truck> p9 <West,Sedan> p10 p3 <NY,F150> <ALL,Civic> p11 p4 <CA,Civic> <ALL,Sierra> p12 p5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

Iteration aware allocation • Optimizations for Independent and Block reduce work for single iteration • Problem: Each iteration of allocation is still expensive • Involves multiple scans of entire fact table • Not feasible for real data warehouses! • Can we do better?

Required Data For Allocating A Fact <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> c1 <CA,ALL> p8 c2 <MA,Sierra> <East,Truck> ` p9 <West,Sedan> p10 c3 <NY,F150> <ALL,Civic> p11 c4 <CA,Civic> <ALL,Sierra> p12 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

Required Data For Allocating A Fact p7 <MA,Truck> c2 <MA,Sierra> <East,Truck> p9 c3 Connected components in allocation graph can be processed independently <NY,F150> <ALL,Sierra> p12 <MA,Sedan> p6 <CA,ALL> p8 <MA,Civic> c1 <West,Sedan> p10 c4 <CA,Civic> <ALL,Civic> p11 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>

Transitive Algorithm • Transitive Algorithm has two steps: • 1) Connected component identification step • 2) Process each connected component • Read component into memory • Perform all iterations of allocation for facts in component • If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! • Components larger than buffer processed using Block algorithm • In real datasets, all components were memory resident Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm

Experimental Setup • Algorithms evaluated on several datasets • Real-world dataset: 798K facts , 4 dimensions • Used several synthetic datasets • Vary level of imprecision in the data • Percentage of imprecise facts • Severity of imprecision • Scalability (up to 5 million tuples) • Important parameter: Ratio of input table size to available memory • Memory limited to restricted buffer pool

Experiment 1a: Memory Resident Real Dataset

Experiment: Memory Resident (2) Synthetic Dataset (more imprecision)

Experiment: Algorithm Scalability

Experiment 1b: Algorithm Scalability

Conclusions • Imprecision is a compelling real-world problem • Propose allocation as a solution • Allocation graph formalism • Basis for 3 scalable allocation algorithms • Independent, Block, Transitive • Transitive algorithm is quite intriguing • Performance is stable as number of iterations increase • Connected components algorithm identifies can be used in proposed EDB maintenance algorithm

Efficient Allocation Algorithms For OLAP Over Imprecise Data

Efficient Allocation Algorithms For OLAP Over Imprecise Data

Presentation Transcript

SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data

Efficient Algorithms for Renewable Energy Allocation to Delay Tolerant Consumers

Efficient Allocation

Efficient Algorithms for Renewable Energy Allocation to Delay Tolerant Consumers

Efficient Algorithms for Matching

Efficient Algorithms for Imputation of Missing SNP Genotype Data

OLAP over Uncertain and Imprecise Data

OLAP Over Uncertain and Imprecise Data

Efficient OLAP Query Processing for Distributed Data Warehouses

Efficient Algorithms for Mining Semi-structured Data

Answering Imprecise Queries over Web Databases

Approximate Selection Queries over Imprecise Data

Algorithms for Student-Project Allocation

Algorithms for Efficient Collaborative Filtering

Data Structures and Algorithms for Efficient Shape Analysis

Efficient Algorithms for Motif Search

Answering Imprecise Queries over Web Databases