310 likes | 478 Vues
Answering Top-k Queries Using Views. By Gautam Das Dimitrios Gunopulos Nick Koudas Dimitris Tsirogiannis. Presented By Raju Buchi Poornima Ancha. AGENDA. Agenda. Introduction Views Related Work Preliminaries Problems Discussed Algorithm LPTA View Selection Problem
E N D
Answering Top-k Queries Using Views By Gautam Das DimitriosGunopulos Nick Koudas DimitrisTsirogiannis Presented By RajuBuchi PoornimaAncha
AGENDA Agenda • Introduction • Views • Related Work • Preliminaries • Problems Discussed • Algorithm LPTA • View Selection Problem • Experimental Results
Introduction I N T R O D U C T I O N • Answering Top-k Queries • Active research topic • Retrieve quickly a number(k) of highest ranking tuples in presence of monotone ranking functions defined on attributes of underlying relations • Algorithms • Threshold Algorithm (TA) by Fagin et. al., • Independently by Guntzer et. al., • Nepal et. al.,
Views I N T R O D U C T I O N • Materialized Views • A database table that contains the results of the query previously asked. Actually constructed and stored. • Problem Discussed • To find efficient methods of answering a query using a set of previously defined materialized views over the database . • Why Views? • Relevance to a variety of data management problems. • Promised increased in performance. • Views are materialized (incurring a space overhead) with the hope to gain in performance for some queries.
Views I N T R O D U C T I O N • Views do not specify any selection conditions on the attributes they aim to rank. • Example: (TOP-k) f1=2x1+5x2 f2=x2+2x3 R View2 (V2) Top-3 Query View1 (V1) Top-5 Query
Views – Example Contd… I N T R O D U C T I O N • Given a top-2 query defined using function f3=3x1+10x2+5x3, we can apply standard top-k algorithm(e.g., TA) using the data from R and obtain answer to the query. • Using Views? • Feasibility • Guarantee an answer • Speed of using R directly vs. Using Views
Related Work R E L A T E D W O R K • Multimedia Context: Uses ordered lists • Threshold Algorithm: • This algorithm requires the scoring function to be monotonic. • i .e. For tuples t and u,t[i]<u[i], 1≤i≤100, then ScoreQ(t)≤ScoreQ(u). • TA requires that each attribute has an index mechanism that allows all tids to be accessible in sorted order. • A single random access is required to resolve all attributes of a tid. • In our paper we focus on Additive scoring functions(monotonic), where ScoreQ(t)=w1t[1]+ w2t[2]+….+ wmt[m]
Related Work R E L A T E D W O R K • Variants: • TA-Sorted - Lists are always accessed sequentially and NO random accesses are performed. • PREFER [Hristidis et. al.,] : • Storing multiple copies of ‘R’. • It assumes to utilize only one copy of a relation which is closest to the new query to answer the new query.
Ranking Queries P R E L I M I N A R I E S • Consider Relation R with m numeric attributes (X1, X2…Xm) • Domi=[lbi, ubi] domain of ith attribute. • Tuple t is viewed as numeric vector t=(t[1], t[2]… t[m]) • Top-k Ranking Queries in SQL-like syntax: • SELECT TOP[k] FROM R WHERE RangeQ ORDER BY ScoreQ • Expressed as a triple Q=(ScoreQ, k, RangeQ) • ScoreQ: Function that assigns a numeric score to any tuple ‘t’. • RangeQ : Boolean function that defines a selection condition for the tuples of ‘R’. • The semantics requires that the system retrieve the k tuples with the top scores satisfying the selection condition.
Ranking Views P R E L I M I N A R I E S • Materialized Ranking View(V): • Materialized result of the tuples of a previously executed top-k query Q, ordered according to the scoring function ScoreQ. • Q’=(ScoreQ’ , k’, RangeQ’ ) • Corresponding materialized ranking view’ is a set of k(tid, ScoreQ(tid) pairs, ordered by decreasing the values of ScoreQ(tid).
Problems Discussed • Problem 1: TOP-k QUERY ANSWERING USING VIEWS • Given a set of views and a query Q, obtain an answer to Q combining all the information conveyed by the views in U. • SOLUTION: Algorithm namedLPTA. • Problem 2: VIEW SELECTION • Given a collection of views V={V1, V2 …VR} that includes the base views(thus r ≥ m) and a query Q, determine the most efficient subset U⊆ V to execute Q on. • Such a subset U will be provided as input to LPTA. • Should identify a set of views that can provide an answer to the query and at same time provide the answer faster than running TA on the base set of views, if possible. P R O B L E M S
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A • An adaptation of TA algorithm in the sense that it answers top-k queries using multiple ranking views • Requires the scoring functions of the query & the views to be linear and additive • Sorted access on pairs (tid, scoreQ(tid)) • Views and Queries are of the form V’ = (ScoreV’, n, *) and Q=(ScoreQ, k, *) respectively. • Pseudo code • Example • General Approach
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A • Pseudo code • Initialize top-k buffer to empty. • Retrieve the tids from the views V1 and V2 in a lock-step fashion, in the order of decreasing score. • Retrieve corresponding tuple by random access on R. • Compute score according to f3 and update top-k buffer to contain largest scores. • Check the stopping condition. • Once the stopping condition is satisfied we will have the results in the top-k buffer.
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A • Stopping Condition: • After dthiteration, • let the tuple read from V1= (tid1d, s1d) and V2= (tid2d, s2d) • and minimum score in the top-k buffer be top-kmin • At this point the unseen tuples have to satisfy the following inequalities: ( Domain of each attribute of R = [1, 100]) • 0≤X1, X2, X3≤100 • 2x1 + 5x2 ≤ s1d • x2 + 2x3 ≤ s2d • This will represent a convex region in 3-d space. • unseenmaxwill be the solution to the linear program where we maximize the function f3=3x1+10x2+5x3
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A • Example: (TOP-k Query Answering using Views) View1 (V1) Top-5 Query View2 (V2) Top-3 Query R f1=2x1+5x2 f2=x2+2x3 6 219 7 527 4 202 299 6 6 12 55 82 7 16 99 42 (7,1248) (6,996) {tidid, sid }={(7,1248), (6,996)} Linear Programming Solution with s1d=527 and s2d=219 gives unseenmax= 1388 f3=3x1+10x2+5x3 Query = (f3, k, *)
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A • Example: (TOP-k Query Answering using Views) View1 (V1) Top-5 Query View2 (V2) Top-3 Query R f1=2x1+5x2 f2=x2+2x3 6 219 7 527 4 80 22 90 4 202 6 299 6 12 55 82 {tidid, sid }={(6,996), (4, 910)} Linear Programming Solution with s1d=299 and s2d=202 gives unseenmax= 953.5 f3=3x1+10x2+5x3 Query = (f3, k, *) ≤ top-kmin
stopping condition LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A R(X1, X2) Top-1 V1 X1 Q tid11 R=(1,1) T=(0,1) tid11 tid21 tid21 V2 X2 O=(0,0) P=(1,0)
d iteration LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A R(X1, X2) Q: fQ=3x1+10x2+5x3 fV1=2x1+5x2 0 ≤ x1, x2, x3 ≤ 100 2x1 + 5x2 ≤ s1d x2 + 2x3 ≤ s2d fV2=x2+2x3 View1 (V1) View2 (V2) unseenmax≤ top-kmin
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A Top-1 R(X1, X2) stopping condition V1 Q X1 T=(0,1) R=(1,1) tid11 tid12 tid21 V2 tid22 P=(1,0) O=(0,0) X2
TA Vs. LPTA T A V S L P T A • LPTA essentially becomes TA when the set of views U equal to the set of base views • In terms of execution cost both have Sequential as well as Random Access • Execution Efficiency: I/O Operations play a significant role – they overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on. • Highly correlated: every sequential access incurs a random access. • Determining factor: • If d = number of lock-step iterations and • r = no. of views, • then running Cost is O(dr).
Conceptual Discussion V I E W S E L E C T I O N • Given a collection of views Ѵ = {V1,V2,….Vr} that includes base views determine the most efficient subset U ⊆Ѵ to execute the query Q on. • Conceptual Discussion • View Selection in Two Dimensions • View Selection in Higher Dimensions
Conceptual Discussion V I E W S E L E C T I O N 2D V2 X Q Min top-k tuple A’2 A A1 A’1 R=(1,1) T=(1,0) M V1 B’1 B B2 B’2 Y O=(0,0) P=(1,0)
Conceptual Discussion V I E W S E L E C T I O N HD For Ѵ = {V1,V2,….Vr} being a set of views for m-dimensional dataset, Q being query, the optimal execution of LPTA requires the use of a subset of the views U ⊆Ѵ such that |U| < m.
View Selection Problem C O S T E S T I M A T I O N • Compute histograms representing the distribution of scores along each view in U. • Estimate top kmin from Hq by determining the bucket which contains the kth highest tuple. • “Walkdown” these histograms until the stopping condition is reached. • Check stopping condition by linear programming. • When Unseen max< top kmin then perform logarithmic search within last bucket. • Number of sorted accesses ((d-1)n/b + n’)r’. • Running time of algorithm is O((d-1)+log n’)
Select Views(Q,V) S E L E C T V I E W S • Consider MinCost and MinCurCost = ∞, U={ }, Vє Ѵ-U • Compare the cost estimate for V with MinCurCost, • if EstimateCost< MinCurCost , add V to MinV. • MinCurCost is now is EstimateCost of V. • ∀ V, above steps are followed • When MinCurCost<MinCost, V is added U • This is repeated for all the attributes m considered.
View Selection Algorithms S E L E C T V I E W S Select Views(Q,V) / Exhaustive : Estimates cost of all possible (rp)subsets of V to select one with minimum cost. Simple Greedy Heuristic : Iterates the set of views , selects the one that reduces the total cost by the greatest amount.
View Selection Algorithms S E L E C T V I E W S Select Views Spherical(Q,V) : it has to solve linear program just once and is very effective for highly restrictive data sets. Select view By Angles : sorts the view vectors by increasing angle with query vector returning top-m views.
More General Queries & Views M O R E G E N E R A L Q U E R I E S & V I E W S • Views that Only Materialize their Top-k Tuples • Truncate the histograms • Accommodating Range Conditions • Select the views that cover the range conditions. • Truncate each attribute’s histogram
Performance Evaluation E X P E R I M E N T A L R E S U L T S (3d) (2d) Real Data, performance comparison of PREFER, LPTA, TA
References R E F E R E N C E S • Answering Top-k Queries Using Views: Gautam Das, DimitriosGunopulos, Nick Koudas • aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt
THANK YOU Questions???