1 / 28

Top-k Query Processing

Top-k Query Processing . Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor. + Sushruth P. + Arjun Dasgupta. Why top-k query processing. Multimedia brings fuzzy data attribute values are graded typically [0,1]

tulia
Télécharger la présentation

Top-k Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta

  2. Why top-k query processing • Multimedia brings fuzzy data • attribute values are graded typically [0,1] • No clear boundary between “answer” / “no answer” • A query in a multimedia database means combining graded attributes • Combine attributes by aggregation function • Aggregation function gives overall grade of object • Return k objects with highest overall grade Example:

  3. Top-k query processing Top-k query processing = Finding k objects that have the highest overall grades • How ?  Which algorithms? • Fagin’s Algorithm (FA) • Threshold Algorithm (TA) • Which is the best algorithm? • Keep in mind: Database system serves as middleware • Multimedia (objects) may be kept in different subsystems • e.g. photoDB, videoDB, search engine • Take into account the limitations of these subsystems

  4. Example • Simple database model • Simplequery • Explaining Fagin’s Algorithm (FA) • Finding top-k with FA • Explaining Threshold Algortihm (TA) • Finding top-k with TA

  5. M Object ID Attribute 1 Attribute 2 d c b a (d, 0.9) (a, 0.9) 0.9 0.85 (a, 0.85) (b, 0.8) 0.8 0.7 (b, 0.7) (c, 0.72) 0.72 0.2 . . . . . . . . 0.6 0.9 . . . . . . . . . . . . (c, 0.2) (d, 0.6) N Example – Simple Database model Sorted L1 Sorted L2

  6. Example – Simple Query Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2(eg: color=red & shape=round) A1 & A2 as a ‘query’ to the middleware results in the middelware combining the grades of A1 en A2 by min(A1, A2) • Aggregation function: • function that gives objects an overall grade based on attribute grades • examples : min, max functions • Monotonicity!

  7. L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algorithm • STEP 1 • Read attributes from every sorted list • Stop when k objects have been seen in common from all lists ID a 0.85 0.9 d 0.9 b 0.8 0.7 0.72 c

  8. ID L2 L1 c (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) Min(A1,A2) A2 A1 . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algortihm • STEP 2 • Random access to find missing grades a 0.85 0.9 0.6 d 0.9 b 0.8 0.7 0.72 0.2

  9. ID c (d, 0.9) (a, 0.85) (b, 0.7) A1 A2 Min(A1,A2) . . . . (c, 0.2) Example – Fagin’s Algortihm • STEP 3 • Compute the grades of the seen objects. • Return the k highest graded objects. L2 L1 (a, 0.9) (b, 0.8) 0.85 a 0.85 0.9 (c, 0.72) 0.6 0.6 d 0.9 . . . . b 0.8 0.7 0.7 0.2 0.2 0.72 (d, 0.6)

  10. d: 0.9 a: 0.85 b: 0.7 . . . . c: 0.2 New Idea !!! Threshold Algorithm (TA) • Read all grades of an object once seen from a sorted access • No need to wait until the lists give k common objects • Do sorted access (and corresponding random accesses) until you have seen the top k answers. • How do we know that grades of seen objects are higher than the grades of unseen objects ? • Predict maximum possible grade unseen objects: L2 L1 a: 0.9 Seen b: 0.8 c: 0.72 T = min(0.72, 0.7) = 0.7 f: 0.6 . . . . f: 0.65 Possibly unseen Threshold value d: 0.6

  11. ID L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 Min(A1,A2) A2 . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.9 0.6 0.6

  12. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.9 d A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.6 0.6 T = min(0.9, 0.9) = 0.9

  13. ID L2 L1 (a, 0.9) (d, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.6 0.9 0.6 b 0.8 0.7 0.7

  14. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.7 0.8 T = min(0.8, 0.85) = 0.8

  15. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Situation at stopping condition 0.85 0.85 0.7 0.8 T = min(0.72, 0.7) = 0.7

  16. Comparison of Fagin’s and Threshold Algorithm • TA sees less objects than FA • TA stops at least as early as FA • When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. • TA may perform more random accesses than FA • In TA, (m-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires only bounded buffer space (k) • At the expense of more random seeks • FA makes use of unbounded buffers

  17. The best algorithm • Which algorithm is the best: TA, FA?? • Define “best” • middleware cost • concept of instance optimality • Consider: • wild guesses • aggregation functions characteristics • Monotone, strictly monotone, strict • database restrictions • distinctness property

  18. Algorithm B isinstance optimal over A and D if : B ЄA and Cost(B,D ) = O(Cost(A,D )) A ЄA,D ЄD Which means that: Cost(B,D ) ≤ c . Cost(A,D ) + c’, A ЄA,D ЄD A A A optimality ratio The best algorithm: concept of optimality A = class of algorithms, AЄA represents an algorithm D = legal inputs to algorithms (databases), D ЄD represents a database middleware cost = cost for processing data subsystems = scS + rcR Cost(A,D ) = middleware cost when running algorithm A over database D

  19. The best algorithm: instance optimality & wild guesses • Intuitively: B instance optimal = always the best algorithm inA • = always optimal • In reality: always is “always”  we will exclude wild guesses algorithms • Wild guess = random access on object not previously encounter by sorted access • In practice not possible • Database need to know ID to do random access • If wild guesses allowed in A then no algorithm can be instance optimal • Wild guesses can find top-k objects by k·m random accesses • (k = #objects , m = #lists)

  20. The best algorithm: aggregation functions • Aggregation function t combines object grades into object’s overall grade: • x1,…,xm t(x1,…,xm) • Monotone : • t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i • Strictly monotone: • t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i • Strict: • t(x1,…,xm) = 1 precisely when xi = 1 for every i

  21. The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade

  22. The best algorithm: Fagin’s Algorithm • - Database with N objects, each with m attributes. • - Orderings of lists are independent • FA finds top-k with middleware cost O(N(m-1)/mk1/m) • FA = optimalwith high probability in the worst case for strict monotone aggregation functions

  23. The best algorithm: Threshold Algorithm • TA = instance optimal (always optimal) for every monotoneaggregation function, over every database(excluding wild guesses) • = optimal in much stronger sense than Fagin’s Algorithm • If strict monotone aggregation function: • Optimality ratio = m + m (m-1)cR/cs = best possible (m = # attributes) • If random acces not possible (cr = 0 )  optimality ratio = m • If sorted access not possible (cs = 0)  optimality ratio = infinite •  TA not instance optimal • TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database(including wild guesses) that satisfies the distinctness property • Optimality ratio = cm2 with c = max {cR/cS, cS/cR}

  24. Extending TA • What if sorted access is restricted ? e.g. use distance database • TA z • What if random access not possible? e.g. web search engine • No Random Access Algorithm • What if we want only the approximate top k objects? • TAθ • What if we consider relative costs of random and sorted access? • Combined Algorithm (between TA and NRA)

  25. NRA • What if we also want the scores?

  26. Combined Algorithm (CA) CA in instance optimal

  27. Approximation • -approximation to the top k answers for the aggregation function t is a collection of k objects (each along with its grade) such that for each y among these k objects and each z not among these k objects,  t(y)>=t(z) • T  : As soon as at least k objects have been seen whose grade is at least equal to threshold/  then halt.

  28. ?

More Related