320 likes | 391 Vues
Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition
E N D
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Weimin He CSE@UTA
Outline Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems Weimin He CSE@UTA
Motivating example Realtor DB: Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock) SQL query: Select * From D Where City=Seattle AND View=Waterfront Weimin He CSE@UTA
Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking Weimin He CSE@UTA
Problem Definition Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes Let be the answer set of Q How to rank tuples in S and return top-k tuples to the user ? Weimin He CSE@UTA
System Architecture Weimin He CSE@UTA
Intuition for Ranking Function • Select * From D Where City=“Seattle” And View=“Waterfront” Score of a Result Tuple t depends on • Global Score: Global Importance of Unspecified Attribute Values • E.g., Homes with good school districts are globally desirable • Conditional Score: Correlations between Specified and Unspecified Attribute Values • E.g., Waterfront BoatDock Weimin He CSE@UTA
Probabilistic Model in IR • Bayes’ Rule • Product Rule • Document t, Query QR: Relevant document setR = D - R: Irrelevant document set Weimin He CSE@UTA
Adaptation of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained Weimin He CSE@UTA
Preliminary Derivation Weimin He CSE@UTA
Limited Independence Assumptions • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed Weimin He CSE@UTA
Continuing Derivation Weimin He CSE@UTA
Workload-based Estimation of Assume a collection of “past” queries existed in system Workload W is represented as a set of “tuples” Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X Weimin He CSE@UTA
Final Ranking Function Weimin He CSE@UTA
Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Weimin He CSE@UTA
Example for Computing Atomic Probabilities • Select * From D Where City=“Seattle” And View=“Waterfront” • Y={SchoolDistrict, BoatDock, …} • D=10,000 W=1000 • W{excellent}=10 • W{waterfront &yes}=5 • p(excellent|W)=10/1000=0.1 • p(excellent|D)=10/10,000=0.01 • p(waterfront|yes,W)=5/1000=0.005 • p(waterfront|yes,D)=5/10,000=0.0005 Weimin He CSE@UTA
Indexing Atomic Probabilities {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) Weimin He CSE@UTA
Scan Algorithm Preprocessing - Atomic Probabilities Module • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-K Tuples Weimin He CSE@UTA
Beyond Scan Algorithm • Scan algorithm is Inefficient Many tuples in the answer set • Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples Weimin He CSE@UTA
Two kinds of Ranked List • CondList Cx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) • GlobList Gx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore) Weimin He CSE@UTA
Index Module Weimin He CSE@UTA
List Merge Algorithm Weimin He CSE@UTA
Experimental Setup • Datasets: • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) • Internet Movie Database (http://www.imdb.com) • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO Weimin He CSE@UTA
Quality Experiments • Conducted on Seattle Homes and Movies tables • Collect a workload from users • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03] Weimin He CSE@UTA
Quality Experiment-Average Precision • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples • Let each user mark 10 tuples in Hi as most relevant to Qi • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm Weimin He CSE@UTA
Quality Experiment- Fraction of Users Preferring Each Algorithm • 5 new queries • Users were given the top-5 results Weimin He CSE@UTA
Performance Experiments • Datasets • Compare 2 Algorithms: • Scan algorithm • List Merge algorithm Weimin He CSE@UTA
Performance Experiments – Pre-computation Time Weimin He CSE@UTA
Performance Experiments – Execution Time Weimin He CSE@UTA
Performance Experiments – Execution Time Weimin He CSE@UTA
Performance Experiments – Execution Time Weimin He CSE@UTA
Conclusion and Open Problems • Automatic ranking for many-answers • Adaptation of PIR to DB • Mutiple-table query • Non-categorical attributes Weimin He CSE@UTA