1 / 64

CS246

CS246. Ranked Queries. Traditional Database Query. (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary between “answers” and “non-answers” Goal: Find all “matching” tuples Optionally ordered by a certain field. T: All Tuples. A: Answer. Clear boundary. Ranked Queries.

kaspar
Télécharger la présentation

CS246

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS246 Ranked Queries

  2. Traditional Database Query • (Dept = “CS”) & (GPA > 3.5) • Boolean semantics • Clear boundary between “answers” and “non-answers” • Goal: Find all “matching” tuples • Optionally ordered by a certain field T: All Tuples A: Answer Clear boundary Junghoo "John" Cho (UCLA Computer Science)

  3. Ranked Queries • Find “cheap” houses “close” to UCLA • Cheap(x) & NearUCLA(x) • Non-Boolean semantics • No clear boundary between “answers” and “non-answers” • Answers inherently ranked • Goal: Find top ranked tuples T: All Tuples A: Answer No clear boundary Junghoo "John" Cho (UCLA Computer Science)

  4. Issues? • How to rank? • Distance 3 miles: proximity? • Similarity: looks like “Tom Cruise”? • How to combine rankings? • Price = 0.8, Distance = 0.2. Overall grade? • Weighting? • Price is twice more “important” than distance? • Query processing? • Traditional query processing is based on Boolean semantics Junghoo "John" Cho (UCLA Computer Science)

  5. Fagin’s paper • Previously all of the 4 issues were a “black art” • No disciplined way to address the problems • Fagin’s paper studied the last 3 issues in a more “disciplined” way • Combination of ranks • Weighting • Query processing • Find general “properties” and derive a formula satisfying the properties Junghoo "John" Cho (UCLA Computer Science)

  6. Topics • Combining multiple grades • Weighting • Efficient query processing Junghoo "John" Cho (UCLA Computer Science)

  7. Rank Combination • Cheap(x) & NearUCLA(x) • Cheap(x) = 0.3 • NearUCLA(x) = 0.8 • Overall ranking? • How would you approach the problem? Junghoo "John" Cho (UCLA Computer Science)

  8. General Query • (Cheap(x) & (NearUCLA(x) | NearBeach(x))) & RedRoof(x) • How to compute the overall grade? & & | Cheap NearUCLA NearBeach RedRoof 0.3 0.2 0.8 0.6 Junghoo "John" Cho (UCLA Computer Science)

  9. Main Idea • Atomic scoring function A(x): given by application • Cheap(x) = 0.3, NearUCLA(x) = 0.2 … • Query: recursive application of AND and OR • (Cheap & (NearUCLA | NearBeach)) & RedRoof • Combination of two grades for “AND” and “OR” • 2-nary function: t: [0, 1]2 [0,1] • Example: min(a, b) for “AND”? Cheap & NearUCLA (x) = min(0.3, 0.2) = 0.2 • Properties of AND/OR scoring function? Junghoo "John" Cho (UCLA Computer Science)

  10. Properties of Scoring Function • Logical equivalence • The same overall score for logically equivalent queries • A&(B|C)(x) = (A&B)|(A&C)(x) • Monotonicity • if A(x1) < A(x2) and B(x1) < B(x2), then A&B(x1) < A&B(x2) • t(x1, x2) < t(x’1, x’2) if xi<x’Ifor all i Junghoo "John" Cho (UCLA Computer Science)

  11. Uniqueness Theorem • The min() and max() are the only scoring functions with the two properties • Min() for “AND” and Max() for “OR” • Quite surprising and interesting result • More discussion later • Is logical equivalence really true? Junghoo "John" Cho (UCLA Computer Science)

  12. Question on Logical Equivalence? • Query: Homepage of “John Grisham” • PageRank & John & Grisham • Logically equivalent, but are they same? • Does logical equivalence hold for non-Boolean queries? & & & & PR John Grisham PR John Grisham Junghoo "John" Cho (UCLA Computer Science)

  13. Summary of Scoring Function • Question: how to combine rankings • Scoring function: combine grades • Results from fuzzy logic • Logical equivalence • Monotonicity • Uniqueness theorem • Min() for “AND” and Max() for “OR” • Logical equivalence may not be valid for graded Boolean expression Junghoo "John" Cho (UCLA Computer Science)

  14. Topics • Combining multiple grades • Weighting • Efficient query processing Junghoo "John" Cho (UCLA Computer Science)

  15. Weighting of Grades • Cheap(x) & NearUCLA(x) • What if proximity is “more important” than price? • Assign weights to each atomic query • Cheap(x) = 0.2, weight = 1 • NearUCLA(x) = 0.8, weight = 10 • Proximity is 10 times more important than price • Overall grade? Junghoo "John" Cho (UCLA Computer Science)

  16. Formalization • m-atomic queries •  = (1, …, m) : weight of each atomic query • X = (x1, …, xm) : grades from each atomic query • f (x1, …, xm) : unweighted scoring function • f(x1, …, xm) : new weighted scoring function • What should f(x1, …, xm) be given ? Properties of f(x1, …, xm)? Junghoo "John" Cho (UCLA Computer Science)

  17. Properties • P1: When all weights are equalf(1/m, …, 1/m)(x1, …, xm) = f(x1, …, xm) • P2: If an argument has zero weight, we can safely drop the argumentf(1, …, m-1, 0) (x1, …, xm) = f(1, …, m-1)(x1, …, xm-1) • P3: f(X) should be locally linearfa+(1-a)’(x1, …, xm) = af(x1, …, xm) + (1-a) f’(x1, …, xm) Junghoo "John" Cho (UCLA Computer Science)

  18. Local Linearity Example • 1 = (1/2, 1/2), f1(X) = 0.22 = (1/4, 3/4), f2(X) = 0.4 • If 3 = (3/8, 5/8) = 1/2 1+ 1/2 2 f3(X) = 1/2 f1(X) + 1/2 f2(X) = 0.3 • Q: m-atomic queries. How many independent weight assignments? • A: m. Only m degrees of freedom • Very strong assumption • Not too unreasonable, but no rationale Junghoo "John" Cho (UCLA Computer Science)

  19. Theorem • 1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +3·(3 - 4) f (x1 , x2 , x3) +…m· m ·f (x1 , …, xm)is the only function that satisfies such properties Junghoo "John" Cho (UCLA Computer Science)

  20. Examples •  = (1/3, 1/3, 1/3) • 1·(1/3-1/3) f (x1) + 2·(1/3-1/3) f (x1, x2) + 3·(1/3) f (x1 , x2 , x3) = f (x1 , x2 , x3) •  = (1/2, 1/4, 1/4) • 1·(1/2-1/4) f (x1) + 2·(1/4-1/4) f (x1, x2) + 3·(1/4) f (x1 , x2 , x3) = 1/4 f (x1) + 3/4 f (x1 , x2 , x3) •  = (1/2, 1/3, 1/6) • 1·(1/2-1/3) f (x1) + 2·(1/3-1/6) f (x1, x2) + 3·(1/6) f (x1 , x2 , x3) = 1/6 f (x1) + 2/6 f (x1 , x2) + 3/6 f (x1 , x2 , x3) Junghoo "John" Cho (UCLA Computer Science)

  21. Summary of Weighting • Question: different “importance” of grades •  = (1, …, m): weight assignment • Uniqueness theorem • Local linearity and two other reasonable assumption • 1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +…m· m ·f (x1 , …, xm) • Linearity assumption questionable Junghoo "John" Cho (UCLA Computer Science)

  22. Application? • Web page ranking • PageRank & (Keyword1 & Keyword2 & …) • Should we use min()? • min(keyword1, keyword2, keyword3,…) • Would it be better than the cosine measure? • If PageRank is 10 times more important, should we use Fagin’s formula? • 9/11 PR + 2/11 min(PR, min(keywords)) • Would it be better than other ranking function? • Is Fagin’s formula practical? Junghoo "John" Cho (UCLA Computer Science)

  23. Topics • Combining multiple grades • Weighting • Efficient query processing Junghoo "John" Cho (UCLA Computer Science)

  24. Question • How can we process ranked queries efficiently? • Top k answers for “Cheap(x) & NearUCLA(x)” • Assume we have good scoring functions • How do we process traditional Boolean query? • GPA > 3.5 & Dept = “CS” • What’s the difference? • What is difficult compared to Boolean query? Junghoo "John" Cho (UCLA Computer Science)

  25. Naïve Solution • Cheap(x) & NearUCLA(x) • Read prices of all houses • Compute distances of all houses • Compute combined grades of all houses • Return the k-highest grade objects • Clearly very expensive when database is large Junghoo "John" Cho (UCLA Computer Science)

  26. Main Idea • We don’t have to check all objects/tuples • Most tuples have low grades and will not be returned • Basic algorithm • Check top objects from each atomic query and find the best objects • Question: How many objects should we see from each “atomic query”? Junghoo "John" Cho (UCLA Computer Science)

  27. Architecture b: 0.78 a: 0.75 • Sorted access • Random access f (x1, x2, x3) any monotonic function How many to check? How to minimize it? d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

  28. Three Papers • Fuzzy queries • Optimal aggregation • Minimal probing Junghoo "John" Cho (UCLA Computer Science)

  29. Fagin’s Model f (x1, x2, x3) Sorted access Sorted access Sorted access d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

  30. Fagin’s Model • Sorted access on all streams • Cost model: # objects accessed by sorted/random accessescs s + cr r • Ignore the cost for “sorting” • Reasonable when objects have been sorted already • Sorted index • Inappropriate when objects have not been sorted • We have to compute grades for all objects • Sorting can be costly Junghoo "John" Cho (UCLA Computer Science)

  31. Main Question • How many objects to access? When can we stop? • A: When we know that we have seen at least k objects whose scores are higher than any unseen objects Junghoo "John" Cho (UCLA Computer Science)

  32. Fagin’s First Algorithm • Read objects from each stream in parallel • Stop when k objects have been seen in common from all streams • Top answers should be in the union of the objects that we have seen • Why?  k objects f (x1, x2, x3) d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

  33. Stopping Condition • Reason • The grades of the k objects in the intersection is higher than any unseen objects • Proof • x: object in the intersection, y: unseen object • y1 x1. Similarly yixi for all i • f (y1, …, ym)f (x1, …, xm) due to monotonicity Junghoo "John" Cho (UCLA Computer Science)

  34. Fagin’s First Algorithm • Get objects from each stream in parallel until we have seen k objects in common from all streams • For all objects that we have seen so far • If its complete grade is not known, obtain unknown grades by random access • Find the object with the highest grade Junghoo "John" Cho (UCLA Computer Science)

  35. a 0.9 0.85 0.85 d 0.9 c 0.7 0.5 0.5 b 0.8 Example (k = 2) a: 0.85 d: 0.6 x1 x2 min min(x1, x2) 0.6 0.6 a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … 0.2 0.2 c: 0.2 d: 0.6 Junghoo "John" Cho (UCLA Computer Science)

  36. Performance • We only look at a subset of objects • Ignoring high cost for random access, clearly better than the naïve solution • Total number of accesses • O(N(m-1)/m k1/m) assuming independent and random object order for each atomic query • E.g., O(N1/2 k1/2) if m = 2 Junghoo "John" Cho (UCLA Computer Science)

  37. Summary of Fagin’s Algorithm • Sorted access on all streams • Stopping condition • k common objects from all streams Junghoo "John" Cho (UCLA Computer Science)

  38. Problem of Fagin’s Algorithm • Performance depends heavily on object orders in the streams • k = 1, min(x1, x2) • We need to read all objects • Sorted access until 3rd objects and random access for all remainder • Can we avoid this pathological scenario? b: 1 a: 1 c: 1 d: 0 e: 0 e: 1 d: 1 b: 1 c: 0 a: 0 Junghoo "John" Cho (UCLA Computer Science)

  39. New Idea • Let us read all grades of an object once we see it from a sorted access • Do not need to wait until the streams give k common objects • Less dependent on the object order • When can we stop? • Until we have seen k common objects from sorted accesses? Junghoo "John" Cho (UCLA Computer Science)

  40. When Can We Stop? • If we are sure that we have seen at least k objects whose grades are higher than those of unseen objects • How do we know the grades of unseen objects? • Can we predict the maximum grade of unseen objects? Junghoo "John" Cho (UCLA Computer Science)

  41. Maximum Grade of Unseen Objects • Assuming min(x1, x2), what will be the maximum grade of unseen objects? a: 1 b: 0.9 c: 0.8 d: 0.7 e: 0.6 e: 1 d: 0.8 b: 0.7 c: 0.7 a: 0.2 • x1 < 0.8 and x2 < 0.7, so at most min(0.8, 0.7) = 0.7 • Generalization? Junghoo "John" Cho (UCLA Computer Science)

  42. Generalization • xi: the minimum grade from stream i by sorted access • f (x1, …, xm) is the maximum grade of unseen objects • xi < xi for all unseen objects • f (x1, …, xm): monotonic x1 x1 x2 x2 Junghoo "John" Cho (UCLA Computer Science)

  43. Basic Idea of TA • We can stop when top k seen object grades are higher than the maximum grade of unseen objects • Maximum grade of unseen objects: f (x1, …, xm) Junghoo "John" Cho (UCLA Computer Science)

  44. Threshold Algorithm • Read one object from each stream by sorted access • For each object O that we just read • Get all grades for O by random access • If f (O) is in top k, store it in a buffer • If the lowest grade of top k object is larger than the threshold f (x1, …, xm) stop Junghoo "John" Cho (UCLA Computer Science)

  45. a 0.9 0.85 0.85 d 0.6 0.9 0.6 c 0.7 0.2 0.2 Example (k = 2) f (0.9,0.9) = 0.9 f (0.8,0.85) = 0.8 f (1,1) = 1 f (0.7,0.5) = 0.5 a: 0.85 d: 0.6 x1 x2 min min(x1, x2) a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … b 0.8 0.5 0.5 c: 0.2 d: 0.6 Junghoo "John" Cho (UCLA Computer Science)

  46. Comparison of FA and TA? • TA sees fewer objects than FA • TA always stops earlier than FA • When we have seen k objects in common, their grades are higher than the threshold • TA may perform more random accesses than FA • In TA, (m-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires bounded buffer space (k) • At the expense of more random seeks Junghoo "John" Cho (UCLA Computer Science)

  47. Comparison of FA and TA • TA can be better in general, but it may perform more random seeks • What if random seek is very expensive or impossible? • Algorithm with no random seek possible? Junghoo "John" Cho (UCLA Computer Science)

  48. Algorithm NRA • An algorithm with no random seek • Isn’t random seek essential? • How can we know the grade of an object when some of its grades are missing? Junghoo "John" Cho (UCLA Computer Science)

  49. Basic Idea • We may still compute the lower bound of an object, even if we miss some of its grades • E.g., max(0.6, x)  0.6 • We may also compute the upper bound of an object, even if we miss some of its grades • E.g., max(0.6, x)  0.8 if x  0.8 • If the lower bound of O1 is higher than the upper bound of other objects, we can return O1 Junghoo "John" Cho (UCLA Computer Science)

  50. Generalization • (x1, …, xm): the minimum grades from sorted access • Lower bound of object: 0 for missing grades • When x3, x4 are missing, f (x1, x2, 0, 0) • From monotonicity • Upper bound of object: xi for missing grades • When x3, x4 are missing, f (x1, x2, x3, x4) • x3 x3, x4 x4, thus f (x1, x2, x3, x4)  f (x1, x2, x3, x4) Junghoo "John" Cho (UCLA Computer Science)

More Related