300 likes | 384 Vues
Learn about optimizing Boolean and ranking queries for database searches, including goal functions, evaluation mechanisms, and search strategies. Discover how to combine Boolean and ranking queries effectively to find top results. Dive into the nature of Boolean and ranking as K-constrained optimization and explore techniques to improve query evaluation methods.
E N D
Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang
Information retrieval Traditional databases Ranking query: Top 5 ranked by gpa Boolean query: dept = CS and year = 2 Many queries naturally combine Boolean and ranking Find top answers + B: dept = CS and year = 2 Qualifying constraint R: gpa Quantifying function Database applications on Web
Motivating scenarios • Data retrieval: • Find houses in certain price range with good price/sqrft ratio • Data analysis: • Find products with highest sale increase in consecutive years Selecth.addressfrom House h Whereh.price ≤ 200k ν h.price ≥ 400k Order byh.size/|h.price-300k|Limit1 Selecth.addressfrom House h, CrimeRate c Whereh.price ≤ 200k ν h.price ≥ 400k andh.zipcode = c.zipcode Order byh.size/|h.price-300k| *c.crimerate-1Limit10 Selectitemidfrom Sales s1, Sales s2 Wheres1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit10
Boolean + Ranking form a coherent goal function • Boolean B + Ranking R = Goal function G For a tuple t R(t) if B(t) is true 0if B(t) isfalse G(t) = B(t)*R(t) = (ie, lowest score)
G D The nature of Boolean + Ranking is K-constrained optimization query • Optimize goal function G over database D h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Goal function G Database D
What is the query evaluation mechanism? Boolean query Ranking query + How to answer?
… … … D D B R Goal function G Current techniques lack of global search mechanism • If evaluated as separate operators • If search by an overall goal function G as a ranking function Boolean query B Ranking query R Boolean query B Ranking query R • Current techniques optimize only condition-by-condition • Current techniques restrict G to be monotonic
G OPT* D D Our thesis:Evaluate query as its nature suggests! Function optimization of G Optimize G over D Discrete state search over D
b1 0-250 250-600 b2 b3 0-100 100-250 250-350 350-600 b6 b7 ……… 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1 We view compound index as discrete space Price (k) 600 1 350 2 5 250 4 3 100 6 size 1500 3000 4000 4500
We view compound index as discrete space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 M22 M32 M23 M33 ……… 250 2 5 1 4 3 … 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1
M22 M32 M23 M33 We view compound index as discrete space conceptually, combined space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 ……… 250 2 5 1 4 3 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1
How to perform the search in the space? • What is the search mechanism? • How to conceptually view the index space of D for search • How to guide the search? • How to use function G to focus the search
We encode as A* because it’s optimal • What A* is: Finding the shortest path • Why we choose: Completeness and optimality with proper heuristics • Complete: guarantee to find shortest path • Optimal: visit least number of nodes origin 3 5 1 5 2 1 7 9 6 destination
Encoding our problem into shortest path is challenging • How to encode: • a tuple a path? • score of tuple distance of path?
Therefore, we encode K-constrained opt. as: • How to encode a tuple to a path? • Adding a virtual target t* only reachable through tuples • How to encode maximal tuple with minimal path? • Quality of path depends solely on the tuple it passes by • For tuple state t D(t, t*) = - G(t) • For two states r, u D(r, u) = 0 M11 0 0 M22 M32 M23 M33 0 0 … M66 M67 M76 M75 M56 M77 M55 0 0 4 2 5 1 - G(1) - G(4) t*
We use function opt. to sketch the landscape of G • Function optimization measures quality of states • Function optimization enables: • 1. How to define heuristics? • 2. How to configure space? • 3. Where to start the search?
1. Define admissible heuristics: Measure tightest upper bound • To guarantee completeness • A* requires admissible heuristics, ie, estimate optimistically • To ensure admissible heuristics • Function optimization gives tightest upper bound • Analytical approaches • Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region
M22 M32 M23 M33 2. Configure descending space: disconnect uphills • To guarantee optimality • A* requires descending heuristics • To ensure descending heuristics • Remove uphill links M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1
M22 M32 M23 M33 Find right start point: Start from local optima • To guarantee correctness • Every tuple state must be reachable from start states • Taking only downhills requires start with high points • To ensure reachability • Initial states should contain all local optima M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1
Putting together: Executing A* on the configured space top-down M11 M22 M32 M23 M33 … M67 M76 M57 M55 M75 M56 M66 M77 4 2 5 1 • Search is implemented as priority queue driven traversal
Putting together: Executing A* on the configured space top-down M11 • Bottom-up approach is always better than top-down M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 4 2 5 1 bottom-up M11 M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 2 5 1 4
Experiments • Comparison vs. • Boolean then ranking • Ranking then boolean • Metrics: node accessed = Nl + Nt • Settings: • Benchmark queries over real dataset • Controlled queries over synthetic dataset
BR_clustered BR_unclustered OPT* Benchmark queries • Datasets: • 19,706 real estate listing crawled online • Queries • Q1: size * bedrms/| price-450k| : [40k<=price<=50k] • Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] • Q3: size/price : [bedrms=3 ν bedrms=4] Q1 Q2 Q3
Controlled queries • Datasets • Three randomly generated datasets of 100k points • Uniform, gaussian, logvariatenormal • Queries • Linear average queries: (eg, 0.4*a + 0.6*b) • Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) • Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)
Conclusion • Problem • Study K-constrained optimization queries as boolean + ranking • Abstraction • Encode K-constrained optimization into shortest path problem • Framework • Develop OPT* to process K-constrained optimization
Thank you! Questions?
How to implement function optimization? • How do we compare with RankSQL? • If bottom-up is always better, why consider top-down • Computing upper bound for each region is costly • Random vs. sequential I/O • Assuming indices on every attribute? • Materialize state space for every query? • Exponential number of states when attribute grows • Not every attribute has index on it • Selective choose the right index (attribute) to use • We do perform experiment to study how the system scale with #attr • Your algorithm is not optimal because you change the space