170 likes | 314 Vues
Query Specific Ranking. CSE 6392 02/27/2006. Content. Comparison of FA and TA algorithm Representing ranking problem as a geometric problem Query Specific Ranking. Comparison between FA and TA algorithm. TA is faster than FA
E N D
Query Specific Ranking CSE 6392 02/27/2006 Database Exploration
Content • Comparison of FA and TA algorithm • Representing ranking problem as a geometric problem • Query Specific Ranking Database Exploration
Comparison between FA and TA algorithm • TA is faster than FA • TA stops as soon as the score of the hypothetical tuple is less than the score of tuples in the top-k buffer. • TA is a bounded buffer algorithm • TA maintains a top-k buffer • FA maintains a set of candidates of all the tuples read until it gets ‘k’ objects in common in these sets. Database Exploration
Comparison between FA and TA • TA has to immediately scan as it reads a tuple in order to find the score in an eager manner. • FA has 2 phases for calculating score: - sort phase - scan phase • TA and FA algorithm requires the scoring function to be monotonic. Database Exploration
Why does TA work? • Stopping condition for TA is: • Score (hypothetical tuple) < score (k-th tuple in top-k buffer) • Idea is that score of unseen tuples will be less that the score of the hypothetical tuple according to the monotonic property. Database Exploration
Closing points on TA and FA • FA algorithm stops only when we get ‘k’ common objects/intersections in the set of candidates. • TA algorithm makes assumptions of unseen tuples based on the score of the hypothetical tuple in order to stop. • Therefore, there is no way FA can stop earlier than TA. • Hence, TA is instance optimal. Database Exploration
Query Specific Ranking • The ranking function we have discussed so far depends on the assumption of total ordering of attributes. • E.g. total ordering of price: - high price is bad - low price is good • In reality, this is not always true. Database Exploration
Query Specific Ranking • Different people will have a different ideal price in mind. • E.g. for one person, an ideal restaurant will be: price = $20 and capacity = 100. • In this case, the ranking function can be: • Score(<P, C>) = 5*|20-p| + 10*|100-c| Database Exploration
Query Specific Ranking • The above ranking function is more realistic than total ranking function. • But the above ranking function is not monotonic. • How can we find the top-k restaurants in this case without looking at the whole data set? Database Exploration
Solution • Assume the data set is sorted on all the attributes of interest. • First, create transformed attributes based on the original attributes involved in the ranking function such that the transformed attributes maintains the monotonic property. • Secondly, simulate sorted access. Database Exploration
Transformed attributes • Consider the restaurant example where: Score(<P, C>) = 5*|20-p| + 10*|100-c| • Transformed attributes are: • ∆p = differential of price from original price • ∆c = differential of capacity from original capacity • Suppose tid1 = <$30, 120> then < ∆p, ∆c>=<10,20> tid2 = <$15, 85> then < ∆p, ∆c>=<5, 15> Database Exploration
Simulating sorted access • Achieving monotonicity is just part of the problem. Need to achieve sorted access on the transformed (∆p and ∆c) attributes. • Suppose if data is presorted on the ‘price’ attribute. • Without presorting the whole dataset, we can go directly to the ‘sweet spot’ (i.e. price = $20 & capacity = 100) using B+ tree index. • From this point do 2 walks in the opposite directions and find ∆p and ∆c in the sorted order and merge them. Database Exploration
Adding Selection • This explains how hard conditions are handled or added to a ranking function. • E.g. Look for restaurants in Arlington • location =“Arlington” hard condition Database Exploration
Handling hard conditions • The query will look like this: Select top[10] From restaurants Where location = “Arlington” Order by 5*abs(120 - price) • How to solve this query? Database Exploration
Handling hard conditions • Do selection first, then do ranking • This method is not the best method for the following reasons: • If selection produces a big result, it defeats the purpose of doing ranking • If selection produces a small result, then doing ranking on it will be an overkill. • The raw data is presorted and doing a selection first on this raw data will destroy the order of tuples. TA requires data to be presorted. Database Exploration
Handling hard conditions • The second method is to integrate selection as part of ranking. • Score (<L,P,C>) = If L= “Arlington” then 5*|20-P| + 10*|100-C| else 0 Database Exploration
Handling hard conditions • Now we are no longer dealing with numeric values alone. • Since location = “Arlington”, ranking function is no longer on numeric data but is instead on characterical data. • How do we deal with ranking function that have characterical data? Database Exploration