150 likes | 162 Vues
This paper explores the problem of answering imprecise queries in web databases and proposes a feasible solution. It discusses the challenges and motivations of the problem, as well as the AIMQ approach that relies on query-tuple similarity and attribute importance estimation.
E N D
Answering Imprecise Queries over Web Databases Ullas Nambiar and Subbarao Kambhampati Department of CS & EngineeringArizona State UniversityVLDB , Aug 30 – Sep 02, 2005, Trondheim, Norway
A Feasible Query Make =“Toyota”, Model=“Camry”, Price ≤ $7000 • Toyota • Camry • $7000 • 1999 Want a ‘sedan’ priced around $7000 • Toyota • Camry • $7000 • 2001 • Camry • Toyota • $6700 • 2000 • Toyota • Camry • $6500 • 1998 • ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Why Imprecise Queries ?
The Imprecise Query Answering Problem Problem Statement: Given a conjunctive query Q over a relation R, find a ranked set of tuples of R that satisfy Q above a threshold of similarity Tsim. Ans(Q) ={x|x Є R, Similarity(Q,x) >Tsim} Constraints: • Autonomous Database • Data accessible only by querying • Data model, operators etc not modifiable • Supports boolean model (relation)
Existing Approaches • Similarity search over Vector space • Data must be stored as vectors of text WHIRL, W. Cohen, 1998 • Enhanced database model • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer VAGUE, A. Motro, 1998 • Support similarity search and query refinement over abstract data types Binderberger et al, 2003 • User guidance • Users provide information about objects required and their possible neighborhood Proximity Search, Goldman et al, 1998 • Limitations: • User/expert must provide similarity measures • New operators to use distance measures • Not applicable over autonomousdatabases
Motivation & Challenges what we did ! Challenges • Estimating Query-Tuple Similarity • Weighted summation of attribute similarities • Syntactic similarity inadequate • Need to estimate semantic similarity • Not enough Ontologies • Measuring Attribute Importance • Not all attributes equally important • Users cannot quantify importance • Objectives • Minimal burden on the end user • No changes to existing database • Domain independent • Motivation • Mimic relevance based ranked retrieval paradigm of IR systems • Can we learn relevance statistics from database ? • Use the estimated relevance model to improve the querying experience of users
AIMQ Dependency Miner DataSource n Data processing Data Collector Probe using Random Sample Queries Mine AFDs & Keys Wrappers DataSource 2 Weighted Dependencies Sample Dataset Similarity Miner Query Engine Identify & Execute Similar Queries DataSource 1 Extract Concepts Estimate similarity Map to precise query Result Ranking Similarity Matrix WWW Imprecise Query Ranked Tuples
Query-Tuple Similarity • Tuples in extended set show different levels of relevance • Ranked according to their similarity to the corresponding tuples in base set using • n = Count(Attributes(R)) and Wimp is the importance weight of the attribute • Euclidean distance as similarity for numerical attributes e.g. Price, Year • VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model
CarDB(Make, Model, Year, Price) Decides: Make, Year Depends: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make).. } Deciding Attribute Order • Attribute relaxation order is all non-keys first then keys • Greedy multi-attribute relaxation • Mine AFDs and Approximate Keys • Create dependence graph using AFDs • Strongly connected hence a topological sort not possible • Using Approximate Key with highest support partition attributes into • Deciding set • Dependent set • Sort the subsets using dependence and influence weights • Measure attribute importance as
Empirical Evaluation • Goal • Test robustness of learned dependencies • Evaluate the effectiveness of the query relaxation and similarity estimation • Database • Used car database CarDB based on Yahoo Autos CarDB( Make, Model, Year, Price, Mileage, Location, Color) • Populated using 100k tuples from Yahoo Autos • Algorithms • AIMQ • RandomRelax – randomly picks attribute to relax • GuidedRelax – uses relaxation order determined using approximate keys and AFDs • ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999) • Compute Neighbours and Links between every tuple • Neighbour – tuples similar to each other • Link – Number of common neighbours between two tuples • Cluster tuples having common neighbours
Robustness of Dependencies Attribute dependence order & Key quality is unaffected by sampling
Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є=0.7. • Resilient to change in Є • Average 8 tuples extracted per relevant tuple for Є=0.5. Increases to 120 tuples for Є=0.7. • Not resilient to change in Є Efficiency of Relaxation Guided Relaxation Random Relaxation
Accuracy over CarDB • 14 queries over 100K tuples • Similarity learned using 25k sample • Mean Reciprocal Rank (MRR) estimated as • Overall high MRR shows high relevance of suggested answers
AIMQ - Summary • An approach for answering imprecise queries over Web database • Mine and use AFDs to determine attribute order • Domain independent semantic similarity estimation technique • Automatically compute attribute importance scores • Empirical evaluation shows • Efficiency and robustness of algorithms • Better performance than current approaches • High relevance of suggested answers • Domain independence