CS246

CS246 Ranked Queries

Traditional Database Query • (Dept = “CS”) & (GPA > 3.5) • Boolean semantics • Clear boundary between “answers” and “non-answers” • Goal: Find all “matching” tuples • Optionally ordered by a certain field T: All Tuples A: Answer Clear boundary Junghoo "John" Cho (UCLA Computer Science)

Ranked Queries • Find “cheap” houses “close” to UCLA • Cheap(x) & NearUCLA(x) • Non-Boolean semantics • No clear boundary between “answers” and “non-answers” • Answers inherently ranked • Goal: Find top ranked tuples T: All Tuples A: Answer No clear boundary Junghoo "John" Cho (UCLA Computer Science)

Issues? • How to rank? • Distance 3 miles: proximity? • Similarity: looks like “Tom Cruise”? • How to combine rankings? • Price = 0.8, Distance = 0.2. Overall grade? • Weighting? • Price is twice more “important” than distance? • Query processing? • Traditional query processing is based on Boolean semantics Junghoo "John" Cho (UCLA Computer Science)

Fagin’s paper • Previously all of the 4 issues were a “black art” • No disciplined way to address the problems • Fagin’s paper studied the last 3 issues in a more “disciplined” way • Combination of ranks • Weighting • Query processing • Find general “properties” and derive a formula satisfying the properties Junghoo "John" Cho (UCLA Computer Science)

Topics • Combining multiple grades • Weighting • Efficient query processing Junghoo "John" Cho (UCLA Computer Science)

Rank Combination • Cheap(x) & NearUCLA(x) • Cheap(x) = 0.3 • NearUCLA(x) = 0.8 • Overall ranking? • How would you approach the problem? Junghoo "John" Cho (UCLA Computer Science)

General Query • (Cheap(x) & (NearUCLA(x) | NearBeach(x))) & RedRoof(x) • How to compute the overall grade? & & | Cheap NearUCLA NearBeach RedRoof 0.3 0.2 0.8 0.6 Junghoo "John" Cho (UCLA Computer Science)

Main Idea • Atomic scoring function A(x): given by application • Cheap(x) = 0.3, NearUCLA(x) = 0.2 … • Query: recursive application of AND and OR • (Cheap & (NearUCLA | NearBeach)) & RedRoof • Combination of two grades for “AND” and “OR” • 2-nary function: t: [0, 1]2 [0,1] • Example: min(a, b) for “AND”? Cheap & NearUCLA (x) = min(0.3, 0.2) = 0.2 • Properties of AND/OR scoring function? Junghoo "John" Cho (UCLA Computer Science)

Properties of Scoring Function • Logical equivalence • The same overall score for logically equivalent queries • A&(B|C)(x) = (A&B)|(A&C)(x) • Monotonicity • if A(x1) < A(x2) and B(x1) < B(x2), then A&B(x1) < A&B(x2) • t(x1, x2) < t(x’1, x’2) if xi<x’Ifor all i Junghoo "John" Cho (UCLA Computer Science)

Uniqueness Theorem • The min() and max() are the only scoring functions with the two properties • Min() for “AND” and Max() for “OR” • Quite surprising and interesting result • More discussion later • Is logical equivalence really true? Junghoo "John" Cho (UCLA Computer Science)

Question on Logical Equivalence? • Query: Homepage of “John Grisham” • PageRank & John & Grisham • Logically equivalent, but are they same? • Does logical equivalence hold for non-Boolean queries? & & & & PR John Grisham PR John Grisham Junghoo "John" Cho (UCLA Computer Science)

Summary of Scoring Function • Question: how to combine rankings • Scoring function: combine grades • Results from fuzzy logic • Logical equivalence • Monotonicity • Uniqueness theorem • Min() for “AND” and Max() for “OR” • Logical equivalence may not be valid for graded Boolean expression Junghoo "John" Cho (UCLA Computer Science)

Weighting of Grades • Cheap(x) & NearUCLA(x) • What if proximity is “more important” than price? • Assign weights to each atomic query • Cheap(x) = 0.2, weight = 1 • NearUCLA(x) = 0.8, weight = 10 • Proximity is 10 times more important than price • Overall grade? Junghoo "John" Cho (UCLA Computer Science)

Formalization • m-atomic queries •  = (1, …, m) : weight of each atomic query • X = (x1, …, xm) : grades from each atomic query • f (x1, …, xm) : unweighted scoring function • f(x1, …, xm) : new weighted scoring function • What should f(x1, …, xm) be given ? Properties of f(x1, …, xm)? Junghoo "John" Cho (UCLA Computer Science)

Properties • P1: When all weights are equalf(1/m, …, 1/m)(x1, …, xm) = f(x1, …, xm) • P2: If an argument has zero weight, we can safely drop the argumentf(1, …, m-1, 0) (x1, …, xm) = f(1, …, m-1)(x1, …, xm-1) • P3: f(X) should be locally linearfa+(1-a)’(x1, …, xm) = af(x1, …, xm) + (1-a) f’(x1, …, xm) Junghoo "John" Cho (UCLA Computer Science)

Local Linearity Example • 1 = (1/2, 1/2), f1(X) = 0.22 = (1/4, 3/4), f2(X) = 0.4 • If 3 = (3/8, 5/8) = 1/2 1+ 1/2 2 f3(X) = 1/2 f1(X) + 1/2 f2(X) = 0.3 • Q: m-atomic queries. How many independent weight assignments? • A: m. Only m degrees of freedom • Very strong assumption • Not too unreasonable, but no rationale Junghoo "John" Cho (UCLA Computer Science)

Theorem • 1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +3·(3 - 4) f (x1 , x2 , x3) +…m· m ·f (x1 , …, xm)is the only function that satisfies such properties Junghoo "John" Cho (UCLA Computer Science)

Examples •  = (1/3, 1/3, 1/3) • 1·(1/3-1/3) f (x1) + 2·(1/3-1/3) f (x1, x2) + 3·(1/3) f (x1 , x2 , x3) = f (x1 , x2 , x3) •  = (1/2, 1/4, 1/4) • 1·(1/2-1/4) f (x1) + 2·(1/4-1/4) f (x1, x2) + 3·(1/4) f (x1 , x2 , x3) = 1/4 f (x1) + 3/4 f (x1 , x2 , x3) •  = (1/2, 1/3, 1/6) • 1·(1/2-1/3) f (x1) + 2·(1/3-1/6) f (x1, x2) + 3·(1/6) f (x1 , x2 , x3) = 1/6 f (x1) + 2/6 f (x1 , x2) + 3/6 f (x1 , x2 , x3) Junghoo "John" Cho (UCLA Computer Science)

Summary of Weighting • Question: different “importance” of grades •  = (1, …, m): weight assignment • Uniqueness theorem • Local linearity and two other reasonable assumption • 1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +…m· m ·f (x1 , …, xm) • Linearity assumption questionable Junghoo "John" Cho (UCLA Computer Science)

Application? • Web page ranking • PageRank & (Keyword1 & Keyword2 & …) • Should we use min()? • min(keyword1, keyword2, keyword3,…) • Would it be better than the cosine measure? • If PageRank is 10 times more important, should we use Fagin’s formula? • 9/11 PR + 2/11 min(PR, min(keywords)) • Would it be better than other ranking function? • Is Fagin’s formula practical? Junghoo "John" Cho (UCLA Computer Science)

Question • How can we process ranked queries efficiently? • Top k answers for “Cheap(x) & NearUCLA(x)” • Assume we have good scoring functions • How do we process traditional Boolean query? • GPA > 3.5 & Dept = “CS” • What’s the difference? • What is difficult compared to Boolean query? Junghoo "John" Cho (UCLA Computer Science)

Naïve Solution • Cheap(x) & NearUCLA(x) • Read prices of all houses • Compute distances of all houses • Compute combined grades of all houses • Return the k-highest grade objects • Clearly very expensive when database is large Junghoo "John" Cho (UCLA Computer Science)

Main Idea • We don’t have to check all objects/tuples • Most tuples have low grades and will not be returned • Basic algorithm • Check top objects from each atomic query and find the best objects • Question: How many objects should we see from each “atomic query”? Junghoo "John" Cho (UCLA Computer Science)

Architecture b: 0.78 a: 0.75 • Sorted access • Random access f (x1, x2, x3) any monotonic function How many to check? How to minimize it? d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

Three Papers • Fuzzy queries • Optimal aggregation • Minimal probing Junghoo "John" Cho (UCLA Computer Science)

Fagin’s Model f (x1, x2, x3) Sorted access Sorted access Sorted access d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

Fagin’s Model • Sorted access on all streams • Cost model: # objects accessed by sorted/random accessescs s + cr r • Ignore the cost for “sorting” • Reasonable when objects have been sorted already • Sorted index • Inappropriate when objects have not been sorted • We have to compute grades for all objects • Sorting can be costly Junghoo "John" Cho (UCLA Computer Science)

Main Question • How many objects to access? When can we stop? • A: When we know that we have seen at least k objects whose scores are higher than any unseen objects Junghoo "John" Cho (UCLA Computer Science)

Fagin’s First Algorithm • Read objects from each stream in parallel • Stop when k objects have been seen in common from all streams • Top answers should be in the union of the objects that we have seen • Why?  k objects f (x1, x2, x3) d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … a: 0.9 b: 0.8 c: 0.7 … Junghoo "John" Cho (UCLA Computer Science)

Stopping Condition • Reason • The grades of the k objects in the intersection is higher than any unseen objects • Proof • x: object in the intersection, y: unseen object • y1 x1. Similarly yixi for all i • f (y1, …, ym)f (x1, …, xm) due to monotonicity Junghoo "John" Cho (UCLA Computer Science)

Fagin’s First Algorithm • Get objects from each stream in parallel until we have seen k objects in common from all streams • For all objects that we have seen so far • If its complete grade is not known, obtain unknown grades by random access • Find the object with the highest grade Junghoo "John" Cho (UCLA Computer Science)

a 0.9 0.85 0.85 d 0.9 c 0.7 0.5 0.5 b 0.8 Example (k = 2) a: 0.85 d: 0.6 x1 x2 min min(x1, x2) 0.6 0.6 a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … 0.2 0.2 c: 0.2 d: 0.6 Junghoo "John" Cho (UCLA Computer Science)

Performance • We only look at a subset of objects • Ignoring high cost for random access, clearly better than the naïve solution • Total number of accesses • O(N(m-1)/m k1/m) assuming independent and random object order for each atomic query • E.g., O(N1/2 k1/2) if m = 2 Junghoo "John" Cho (UCLA Computer Science)

Summary of Fagin’s Algorithm • Sorted access on all streams • Stopping condition • k common objects from all streams Junghoo "John" Cho (UCLA Computer Science)

Problem of Fagin’s Algorithm • Performance depends heavily on object orders in the streams • k = 1, min(x1, x2) • We need to read all objects • Sorted access until 3rd objects and random access for all remainder • Can we avoid this pathological scenario? b: 1 a: 1 c: 1 d: 0 e: 0 e: 1 d: 1 b: 1 c: 0 a: 0 Junghoo "John" Cho (UCLA Computer Science)

New Idea • Let us read all grades of an object once we see it from a sorted access • Do not need to wait until the streams give k common objects • Less dependent on the object order • When can we stop? • Until we have seen k common objects from sorted accesses? Junghoo "John" Cho (UCLA Computer Science)

When Can We Stop? • If we are sure that we have seen at least k objects whose grades are higher than those of unseen objects • How do we know the grades of unseen objects? • Can we predict the maximum grade of unseen objects? Junghoo "John" Cho (UCLA Computer Science)

Maximum Grade of Unseen Objects • Assuming min(x1, x2), what will be the maximum grade of unseen objects? a: 1 b: 0.9 c: 0.8 d: 0.7 e: 0.6 e: 1 d: 0.8 b: 0.7 c: 0.7 a: 0.2 • x1 < 0.8 and x2 < 0.7, so at most min(0.8, 0.7) = 0.7 • Generalization? Junghoo "John" Cho (UCLA Computer Science)

Generalization • xi: the minimum grade from stream i by sorted access • f (x1, …, xm) is the maximum grade of unseen objects • xi < xi for all unseen objects • f (x1, …, xm): monotonic x1 x1 x2 x2 Junghoo "John" Cho (UCLA Computer Science)

Basic Idea of TA • We can stop when top k seen object grades are higher than the maximum grade of unseen objects • Maximum grade of unseen objects: f (x1, …, xm) Junghoo "John" Cho (UCLA Computer Science)

Threshold Algorithm • Read one object from each stream by sorted access • For each object O that we just read • Get all grades for O by random access • If f (O) is in top k, store it in a buffer • If the lowest grade of top k object is larger than the threshold f (x1, …, xm) stop Junghoo "John" Cho (UCLA Computer Science)

a 0.9 0.85 0.85 d 0.6 0.9 0.6 c 0.7 0.2 0.2 Example (k = 2) f (0.9,0.9) = 0.9 f (0.8,0.85) = 0.8 f (1,1) = 1 f (0.7,0.5) = 0.5 a: 0.85 d: 0.6 x1 x2 min min(x1, x2) a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … b 0.8 0.5 0.5 c: 0.2 d: 0.6 Junghoo "John" Cho (UCLA Computer Science)

Comparison of FA and TA? • TA sees fewer objects than FA • TA always stops earlier than FA • When we have seen k objects in common, their grades are higher than the threshold • TA may perform more random accesses than FA • In TA, (m-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires bounded buffer space (k) • At the expense of more random seeks Junghoo "John" Cho (UCLA Computer Science)

Comparison of FA and TA • TA can be better in general, but it may perform more random seeks • What if random seek is very expensive or impossible? • Algorithm with no random seek possible? Junghoo "John" Cho (UCLA Computer Science)

Algorithm NRA • An algorithm with no random seek • Isn’t random seek essential? • How can we know the grade of an object when some of its grades are missing? Junghoo "John" Cho (UCLA Computer Science)

Basic Idea • We may still compute the lower bound of an object, even if we miss some of its grades • E.g., max(0.6, x)  0.6 • We may also compute the upper bound of an object, even if we miss some of its grades • E.g., max(0.6, x)  0.8 if x  0.8 • If the lower bound of O1 is higher than the upper bound of other objects, we can return O1 Junghoo "John" Cho (UCLA Computer Science)

Generalization • (x1, …, xm): the minimum grades from sorted access • Lower bound of object: 0 for missing grades • When x3, x4 are missing, f (x1, x2, 0, 0) • From monotonicity • Upper bound of object: xi for missing grades • When x3, x4 are missing, f (x1, x2, x3, x4) • x3 x3, x4 x4, thus f (x1, x2, x3, x4)  f (x1, x2, x3, x4) Junghoo "John" Cho (UCLA Computer Science)

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data &amp; File Structures Secondary Memory

CS246 Data &amp; File Structures Lecture 1 Introduction to File Systems

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems