Aggregate Query Answering under Uncertain Schema Mappings

Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented By Stephen Lynn

Overview • Aggregate Queries • Probabilistic Schema Mapping • Goals/Objectives • Aggregate Processing (3 proposals) • By-Table Algorithm • By-Tuple Algorithm • Evaluation • Analysis

Aggregate Queries COUNT, MIN, MAX, SUM, AVG Simple PTIME algorithms to compute

Probabilistic Schema Mappings

By-Table vs By-Tuple • Tuple – consider all possible mappings for each tuple • Table – single mapping for entire table • P(date→postedDate) = 0.7 • P(date→reducedDate) = 0.3

Goals/Objectives • Impact Analysis of Probabilistic Schemas on Aggregate Queries • Aggregate Query Algorithms • Time Complexity Analysis • Evaluation

Aggregation Methods Range Distribution Expected Value

Method Relationships • Distribution • Most time consuming • Most information • Range • Computed directly from distribution • Expected Value • Computed directly from distribution More efficient ways to compute

By-Table Algorithm All PTIME computable

By-Tuple Algorithm (COUNT) O(n * m)

Example By-Tuple (COUNT)

Time Complexity

Evaluation • Empirical Evaluation • Real-world dataset (eBay) • Synthetic dataset • Evaluate Time Complexity • Vary tuple numbers • Vary attribute mappings

Evaluation Results

Analysis • Strengths • Effect of probabilistic schemas on aggregates • Nice PTIME algorithms • Weaknesses • Evaluation was obvious • By-Table results biased by database optimizations • Future Work • Improve algorithms • Extend to sub-queries • Heuristics

Aggregate Query Answering under Uncertain Schema Mappings