Answering Imprecise Queries over Web Databases

Answering Imprecise Queries over Web Databases Ullas Nambiar and Subbarao Kambhampati Department of CS & EngineeringArizona State UniversityVLDB , Aug 30 – Sep 02, 2005, Trondheim, Norway

A Feasible Query Make =“Toyota”, Model=“Camry”, Price ≤ $7000 • Toyota • Camry • $7000 • 1999 Want a ‘sedan’ priced around $7000 • Toyota • Camry • $7000 • 2001 • Camry • Toyota • $6700 • 2000 • Toyota • Camry • $6500 • 1998 • ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Why Imprecise Queries ?

The Imprecise Query Answering Problem Problem Statement: Given a conjunctive query Q over a relation R, find a ranked set of tuples of R that satisfy Q above a threshold of similarity Tsim. Ans(Q) ={x|x Є R, Similarity(Q,x) >Tsim} Constraints: • Autonomous Database • Data accessible only by querying • Data model, operators etc not modifiable • Supports boolean model (relation)

Existing Approaches • Similarity search over Vector space • Data must be stored as vectors of text WHIRL, W. Cohen, 1998 • Enhanced database model • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer VAGUE, A. Motro, 1998 • Support similarity search and query refinement over abstract data types Binderberger et al, 2003 • User guidance • Users provide information about objects required and their possible neighborhood Proximity Search, Goldman et al, 1998 • Limitations: • User/expert must provide similarity measures • New operators to use distance measures • Not applicable over autonomousdatabases

Motivation & Challenges what we did ! Challenges • Estimating Query-Tuple Similarity • Weighted summation of attribute similarities • Syntactic similarity inadequate • Need to estimate semantic similarity • Not enough Ontologies • Measuring Attribute Importance • Not all attributes equally important • Users cannot quantify importance • Objectives • Minimal burden on the end user • No changes to existing database • Domain independent • Motivation • Mimic relevance based ranked retrieval paradigm of IR systems • Can we learn relevance statistics from database ? • Use the estimated relevance model to improve the querying experience of users

AIMQ Dependency Miner DataSource n Data processing Data Collector Probe using Random Sample Queries Mine AFDs & Keys Wrappers DataSource 2 Weighted Dependencies Sample Dataset Similarity Miner Query Engine Identify & Execute Similar Queries DataSource 1 Extract Concepts Estimate similarity Map to precise query Result Ranking Similarity Matrix WWW Imprecise Query Ranked Tuples

The AIMQ approach

Query-Tuple Similarity • Tuples in extended set show different levels of relevance • Ranked according to their similarity to the corresponding tuples in base set using • n = Count(Attributes(R)) and Wimp is the importance weight of the attribute • Euclidean distance as similarity for numerical attributes e.g. Price, Year • VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model

CarDB(Make, Model, Year, Price) Decides: Make, Year Depends: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make).. } Deciding Attribute Order • Attribute relaxation order is all non-keys first then keys • Greedy multi-attribute relaxation • Mine AFDs and Approximate Keys • Create dependence graph using AFDs • Strongly connected hence a topological sort not possible • Using Approximate Key with highest support partition attributes into • Deciding set • Dependent set • Sort the subsets using dependence and influence weights • Measure attribute importance as

Empirical Evaluation • Goal • Test robustness of learned dependencies • Evaluate the effectiveness of the query relaxation and similarity estimation • Database • Used car database CarDB based on Yahoo Autos CarDB( Make, Model, Year, Price, Mileage, Location, Color) • Populated using 100k tuples from Yahoo Autos • Algorithms • AIMQ • RandomRelax – randomly picks attribute to relax • GuidedRelax – uses relaxation order determined using approximate keys and AFDs • ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999) • Compute Neighbours and Links between every tuple • Neighbour – tuples similar to each other • Link – Number of common neighbours between two tuples • Cluster tuples having common neighbours

Robustness of Dependencies Attribute dependence order & Key quality is unaffected by sampling

Robustness of Value Similarities

Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є=0.7. • Resilient to change in Є • Average 8 tuples extracted per relevant tuple for Є=0.5. Increases to 120 tuples for Є=0.7. • Not resilient to change in Є Efficiency of Relaxation Guided Relaxation Random Relaxation

Accuracy over CarDB • 14 queries over 100K tuples • Similarity learned using 25k sample • Mean Reciprocal Rank (MRR) estimated as • Overall high MRR shows high relevance of suggested answers

AIMQ - Summary • An approach for answering imprecise queries over Web database • Mine and use AFDs to determine attribute order • Domain independent semantic similarity estimation technique • Automatically compute attribute importance scores • Empirical evaluation shows • Efficiency and robustness of algorithms • Better performance than current approaches • High relevance of suggested answers • Domain independence

Answering Imprecise Queries over Web Databases

Answering Imprecise Queries over Web Databases

Presentation Transcript

SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data

Supporting Queries with Imprecise Constraints

Answering Imprecise Queries over Autonomous Web Databases

Evaluating Top-k Queries Over Web-Accessible Databases

Databases – Queries and Database Practice Queries

Answering queries across mappings

Answering Queries: Problems

Answering Approximate Queries Efficiently

Answering Imprecise Queries over Web Databases

Answering Relationship Queries on the Web

Answering Arbitrary Conjunctive Queries over Incomplete Data Stream Histories

Completeness of Queries over Incomplete Databases

Answering Queries Using Views

Approximate Selection Queries over Imprecise Data

Answering Queries Using Views

Evaluating top-k Queries over Web-Accessible Databases

Two Sides of Fuzzy Databases: Flexible Queries and Imprecise Information Management

Evaluating Top-k Queries over Web-Accessible Databases

Completeness of Queries over Incomplete Databases Simon Razniewski

Answering Approximate Queries Efficiently

Two Sides of Fuzzy Databases: Flexible Queries and Imprecise Information Management