230 likes | 345 Vues
Query Processing over Incomplete Autonomous Databases. Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati Arizona State University 2008-02-04 Summerized By Sungchan Park. Introduction.
E N D
Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati Arizona State University 2008-02-04 Summerized By Sungchan Park
Introduction • More and more data is becoming accessible via web servers which are supported by backend autonomous databases • E.g. Cars.com, Realtor.com, Google Base, Etc. Autonomous Database Mediator Autonomous Database Autonomous Database Center for E-Business Technology
Web DB.s are Incomplete! • Incomplete Entry • Inaccurate Extraction • Heterogeneous Schemas • User-Defined Schemas Center for E-Business Technology
Problem • Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints • Although there has been work on handling incompleteness in databases, much of it has been focused on single databases on which the query processor has complete control. • Modify databases directly by replacing null values with likely values. • Not applicable to autonomous databases Center for E-Business Technology
Possible Naïve Approaches Query Q: (Body Style = Convt) • CERTAINONLY • Return only certain answer • Low Recall • ALLRETURNED • Return all answer having Body Style = Convt or Body Style = Null • Low Precision, Infeasible • ALLRANKED • Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user • Costly, Infeasible Center for E-Business Technology
QPIAD • Solved the problem by generating rewritten queries according to a set of mined attribute correlation rules. • Approximate Functional Dependency(AFD) • Naïve Bayesian Classifier Center for E-Business Technology
QPIAD Solution Center for E-Business Technology
QPIAD Architecture Center for E-Business Technology
Overall Process • Learn • Rewrite • Rank • Explain Center for E-Business Technology
#1. Learn - AFD • Learn Attribute Correlations • Approximate Functional Dependencies(AFD) • Approximate Keys(Akeys) • For pruning • Learn by TANE algorithm • Y. Huhtala, et al. Efficient discovery of functional and approximate dependencies using partition. 1998. • Pruning example • AFD {A1, A2} ~> A3 • Akey {A1} Center for E-Business Technology
#1. Learn - Naïve Bayesian Classifier • Learn Value distribution by NBC • Using mined AFD as selected feature • E.g. • AFD {Make, Body} ~> Model • P(Model = Accord | Make = Honda, Body = Coupe) = ? Center for E-Business Technology
#1. Learn - Selectivity • SmplSel(Q)*SmplRatio(R)*PerInc(R) • SmplSel(Q) = Selectivity of rewritten query issued on sample • SmplRatio(R) = Ratio of original database size over sample • PerInc(R) = Percent of incomplete tuples while creating sample Center for E-Business Technology
#2. Rewrite • Get base result(Certain answers) • Generate rewritten queries by base result and learned AFD Rewritten Queries Center for E-Business Technology
#3. Rank • Select top-k queries based on F-Measure • Reorder selected query based on P • Retrieve tuples P = learned Prob. R = selectivity Center for E-Business Technology
#4. Explain Center for E-Business Technology
Other Issues: Correlated Source Center for E-Business Technology
Other Issues: Handling Aggregation Center for E-Business Technology
Empirical Evaluation: Quality • QPIAD vs. ALLRETURNED • ALLRETURNEDhas low precision because not all tuples with missing values on the constrained attributes are relevant to the query • QPIAD has a much higher precision than ALLRETURNED as it aims to retrieve tuples with missing values on the constrained attributes which are very likely to be relevant to the query Center for E-Business Technology
Empirical Evaluation: Efficiency • QPIAD vs. ALLRANKED • ALLRANKED approach is often infeasible as direct retrieval of null values is not often allowed • QPIAD is able to achieve the same level of recall as ALLRANKED while requiring much fewer tuples to be retrieved Center for E-Business Technology
Empirical Evaluation: Robustness • Robustness w.r.t. Sample Size • QPIAD is robust even when face with a relatively small data sample Center for E-Business Technology
Empirical Evaluation: Extensions • Aggregates • Prediction of missing values increases the fraction of queries that achieve higher levels of accuracy • Approximately 20% more queries achieve 100% accuracy when prediction is used • Join • As alpha is increased, we obtain a higher recall without sacrificing much precision Center for E-Business Technology
Related Work • Querying Incomplete Databases • Possible World Approaches – tracks the completions of incomplete tuples (CoddTables, V-Tables, Conditional Tables) • Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers • Probabilistic Databases • Tuples are associated with an attribute describing the probability of its existence • However, in our work, the mediator does not have the capability to modify the underlying autonomous databases • Query Reformulation / Relaxation • Aims to return similar or approximate answers to the user after returning or in the absence of exact answers • Our focus is on retrieving tuples with missing values on constrained attributes • Learning Missing Values • Common imputation approaches replace missing values by substituting the mean, most common value, default value, or using kNN, association rules, etc. • Our work requires schema level dependencies between attributes as well as distribution information over missing values Center for E-Business Technology
Contribution • Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns • Query Rewriting • Retrieves answers with missing values on constrained attributes without modifying the underlying databases • AFD-Enhanced Classifiers • Rewriting & ranking considers the natural tension between precision and recall • F-Measure based ranking • AFDs play a major role in: • Query Rewriting • Feature Selection • Explanations Center for E-Business Technology