1 / 60

Query Processing in Information Retrieval Systems

Learn about the formulation, document representation, and classic models like Boolean and Vector Space Model. Dive into TF-IDF scoring, VSM advantages, and retrieval evaluation techniques.

joewilliams
Télécharger la présentation

Query Processing in Information Retrieval Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF 2914Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining

  2. Algorithms for Large Data Sets Ziv Bar-Yossef http://www.ee.technion.ac.il/courses/049011

  3. Abstract Formulation • Ingredients: • D: document collection • Q: query space • f: D x Q  R: relevance scoring function • For every q in Q, f induces a ranking (partial order) q on D • Functions of an IR system: • Preprocess D and create an index I • Given q in Q, use I to produce a permutation  on D

  4. Document Representation • T = { t1,…, tk }: a “token space” • (a.k.a. “feature space” or “term space”) • Ex: all words in English • Ex: phrases, URLs, … • A document: a real vector d in Rk • di: “weight” of token ti in d • Ex: di = normalized # of occurrences of ti in d

  5. Classic IR (Relevance) Models • The Boolean model • The Vector Space Model (VSM)

  6. The Boolean Model • A document: a boolean vector d in {0,1}k • di = 1 iff ti belongs to d • A query: a boolean formula q over tokens • q: {0,1}k {0,1} • Ex: “Michael Jordan” AND (NOT basketball) • Ex: +“Michael Jordan” –basketball • Relevance scoring function: f(d,q) = q(d)

  7. The Boolean Model: Pros & Cons • Advantages: • Simplicity for users • Disadvantages: • Relevance scoring is too coarse

  8. The Vector Space Model (VSM) • A document: a real vector d in Rk • di = weight of ti in d (usually TF-IDF score) • A query: a real vector q in Rk • qi = weight of ti in q • Relevance scoring function: f(d,q) = sim(d,q) • “similarity” between d and q

  9. Popular Similarity Measures d • L1 or L2 distance • d,q are first normalized to have unit norm • Cosine similarity d –q q d  q

  10. TF-IDF Score: Motivation • Motivating principle: • A term ti is relevant to a document d if: • ti occurs many times in d relative to other terms that occur in d • ti occurs many times in d relative to its number of occurrences in other documents • Examples • 10 out of 100 terms in d are “java” • 10 out of 10,000 terms in d are “java” • 10 out of 100 terms in d are “the”

  11. TF-IDF Score: Definition • n(d,ti) = # of occurrences of ti in d • N = i n(d,ti) (# of tokens in d) • Di = # of documents containing ti • D = # of documents in the collection • TF(d,ti): “Term Frequency” • Ex: TF(d,ti) = n(d,ti) / N • Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) }) • IDF(ti): “Inverse Document Frequency” • Ex: IDF(ti) = log (D/Di) • TFIDF(d,ti) = TF(d,ti) x IDF(ti)

  12. VSM: Pros & Cons • Advantages: • Better granularity in relevance scoring • Good performance in practice • Efficient implementations • Disadvantages: • Assumes term independence

  13. Retrieval Evaluation • Notations: • D: document collection • Dq: documents in D that are “relevant” to query q • Ex: f(d,q) is above some threshold • Lq: list of results on query q D Lq Dq Recall: Precision:

  14. Precision & Recall: Example List A List B Relevant docs: d123, d56, d9, d25, d3 • Recall(A) = 80% • Precision(A) = 40% • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5 • Recall(B) = 100% • Precision(B) = 50%

  15. Precision@k and Recall@k • Notations: • Dq: documents in D that are “relevant” to q • Lq,k: top k results on the list Recall@k: Precision@k:

  16. Precision@k: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

  17. Recall@k: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

  18. “Interpolated” Precision • Notations: • Dq: documents in D that are “relevant” to q • r: a recall level (e.g., 20%) • k(r): first k so that recall@k >= r Interpolated precision@ recall level r = max { precision@k : k >= k(r) }

  19. Precision vs. Recall: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

  20. Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor Based on the presentation of Wesley Sebrechts, Joost Voordouw. Modified by Vagelis Hristidis

  21. Why top-k query processing • Multimedia brings fuzzy data • attribute values are graded typically [0,1] • No clear boundary between “answer” / “no answer” • A query in a multimedia database means combining graded attributes • Combine attributes by aggregation function • Aggregation function gives overall grade of object • Return k objects with highest overall grade Example:

  22. Top-k query processing Top-k query processing = Finding k objects that have the highest overall grades • How ?  Which algorithms? • Fagin’s Algorithm (FA) • Threshold Algorithm (TA) • Which is the best algorithm? • Keep in mind: Database system serves as middleware • Multimedia (objects) may be kept in different subsystems • e.g. photoDB, videoDB, search engine • Take into account the limitations of these subsystems

  23. Example • Simple database model • Simplequery • Explaining Fagin’s Algorithm (FA) • Finding top-k with FA • Explaining Threshold Algortihm (TA) • Finding top-k with TA

  24. M Object ID Attribute 1 Attribute 2 d c b a (d, 0.9) (a, 0.9) 0.9 0.85 (a, 0.85) (b, 0.8) 0.8 0.7 (b, 0.7) (c, 0.72) 0.72 0.2 . . . . . . . . 0.6 0.9 . . . . . . . . . . . . (c, 0.2) (d, 0.6) N Example – Simple Database model Sorted L1 Sorted L2

  25. Example – Simple Query Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2(eg: color=red & shape=round) A1 & A2 as a ‘query’ to the middleware results in the middleware combining the grades of A1 en A2 by min(A1, A2) • Aggregation function: • function that gives objects an overall grade based on attribute grades • examples : min, max functions • Monotonicity!

  26. L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algorithm • STEP 1 • Read attributes from every sorted list • Stop when k objects have been seen in common from all lists ID a 0.85 0.9 d 0.9 b 0.8 0.7 0.72 c

  27. ID L2 L1 c (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) Min(A1,A2) A2 A1 . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algorithm • STEP 2 • Random access to find missing grades a 0.85 0.9 0.6 d 0.9 b 0.8 0.7 0.72 0.2

  28. ID c (d, 0.9) (a, 0.85) (b, 0.7) A1 A2 Min(A1,A2) . . . . (c, 0.2) Example – Fagin’s Algorithm • STEP 3 • Compute the grades of the seen objects. • Return the k highest graded objects. L2 L1 (a, 0.9) (b, 0.8) 0.85 a 0.85 0.9 (c, 0.72) 0.6 0.6 d 0.9 . . . . b 0.8 0.7 0.7 0.2 0.2 0.72 (d, 0.6)

  29. d: 0.9 a: 0.85 b: 0.7 . . . . c: 0.2 New Idea !!! Threshold Algorithm (TA) • Read all grades of an object once seen from a sorted access • No need to wait until the lists give k common objects • Do sorted access (and corresponding random accesses) until you have seen the top k answers. • How do we know that grades of seen objects are higher than the grades of unseen objects ? • Predict maximum possible grade unseen objects: L2 L1 a: 0.9 Seen b: 0.8 c: 0.72 T = min(0.72, 0.7) = 0.7 f: 0.6 . . . . Possibly unseen f: 0.65 Threshold value d: 0.6

  30. ID L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 Min(A1,A2) A2 . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.9 0.6 0.6

  31. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.9 d A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.6 0.6 T = min(0.9, 0.9) = 0.9

  32. ID L2 L1 (a, 0.9) (d, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.6 0.9 0.6 b 0.8 0.7 0.7

  33. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.7 0.8 T = min(0.8, 0.85) = 0.8

  34. ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Situation at stopping condition 0.85 0.85 0.7 0.8 T = min(0.72, 0.7) = 0.7

  35. Comparison of Fagin’s and Threshold Algorithm • TA sees less objects than FA • TA stops at least as early as FA • When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. • TA may perform more random accesses than FA • In TA, (m-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires only bounded buffer space (k) • At the expense of more random seeks • FA makes use of unbounded buffers

  36. The best algorithm • Which algorithm is the best? • Define “best” • middleware cost • concept of instance optimality • Consider: • wild guesses • aggregation functions characteristics • Monotone, strictly monotone, strict • database restrictions • distinctness property

  37. Algorithm B isinstance optimal over A and D if : B EA and Cost(B,D ) = O(Cost(A,D )) A EA,D ED Which means that: Cost(B,D ) ≤ c · Cost(A,D ) + c’, A EA,D ED A A A optimality ratio The best algorithm: concept of optimality A = class of algorithms, AE A represents an algorithm D = legal inputs to algorithms (databases), D ED represents a database middleware cost = cost for processing data subsystems = sc + rc Cost(A,D ) = middleware cost when running algorithm A over database D

  38. The best algorithm: instance optimality & wild guesses • Intuitively: B instance optimal = always the best algorithm inA • = always optimal • In reality: always is “always”  we will exclude wild guesses algorithms • Wild guess = random access on object not previously encountered • by sorted access • In practice not possible • Database need to know ID to do random access • If wild guesses allowed in A then no algorithm can be instance optimal • Wild guesses can find top-k objects by k·m random accesses • (k = #objects , m = #lists)

  39. The best algorithm: aggregation functions • Aggregation function t combines object grades into object’s overall grade: • x1,…,xm t(x1,…,xm) • Monotone : • t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i • Strictly monotone: • t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i • Strict: • t(x1,…,xm) = 1 precisely when xi = 1 for every i

  40. The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade

  41. Fagin’s Algorithm • - Database with N objects, each with m attributes. • - Orderings of lists are independent • FA finds top-k with middleware cost O(N(m-1)/mk1/m) • FA = optimalwith high probability in the worst case for strict monotone aggregation functions

  42. Threshold Algorithm • TA = instance optimal (always optimal) for everymonotoneaggregation function, over every database(excluding wild guesses) • = optimal in much stronger sense than Fagin’s Algorithm • If strict monotone aggregation function: • Optimality ratio = m + m (m-1)cR/cs = best possible (m = # attributes) • If random acces not possible (cr = 0 )  optimality ratio = m • If sorted access not possible (cs = 0)  optimality ratio = infinite •  TA not instance optimal • TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database(including wild guesses) that satisfies the distinctness property • Optimality ratio = cm2 with c = max {cR/cS, cS/cR}

  43. Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201

  44. The Problem: “how to optimize query throughput in large search engines, when the ranking function is a combination of term-based ranking and a global ordering such as Pagerank” Talk Outline: • intro: query processing in search engines • related work: query execution and pruning techniques • algorithmic techniques • experimental evaluation: single and multiple nodes • concluding remarks

  45. Query Processing in Parallel Search Engines • low-cost cluster architecture (usually with additional replication) Cluster with global index organization query integrator broadcasts each query and combines the results LAN index index index index index pages pages pages pages pages • local index: every node stores and indexes subset of pages • every query broadcast to all nodes by query integrator (QI) • every node supplies top-10, and QI computes global top-10 • note: we don’t really need top-10 from all, maybe only top-2

  46. Related Work on top-k Queries • IR: optimized evaluation of cosine measures (since 1980s) • DB: top-k queries for multimedia databases (Fagin 1996) • does not consider combinations of term-based and global scores • Brin/Page 1998: fancy lists in Google Related Work (IR) • basic idea: “presort entries in each inverted list by contribution to cosine” • also process inverted lists from shortest to longest list • various schemes, either reliable or probabilistic • most closely related: • - Persin/Zobel/Sacks-Davis 1993/96 • - Anh/Moffat 1998, Anh/deKretzer/Moffat 2001 • typical assumptions: many keywords/query, OR semantics

  47. Related Work (DB) (Fagin 1996 and others) • motivation: searching multimedia objects by several criteria • typical assumptions: few attributes, OR semantics, random access • FA (Fagin’s algorithm), TA (Threshold algorithm), others • formal bounds: for k lists if lists independent • term-based ranking: presort each list by contribution to cosine

  48. Related Work (Google) (Brin/Page 1998) • “fancy lists” optimization in Google • create extra shorter inverted list for “fancy matches” • (matches that occur in URL, anchor text, title, bold face, etc.) • note: fancy matches can be modeled by higher • weights in the term-based vector space model • no details given or numbers published chair fancy list rest of list with other matches table fancy list rest of list with other matches

  49. Results of our Paper • pruning techniques for query execution in large search engines • focus on a combination of a term-based and a global score • (such as Pagerank) • techniques combine previous approaches such as fancy lists • and presorting of lists by term scores • experimental evaluation on 120 million pages • very significant savings with almost no impact on results • it’s good to have a global ordering!

  50. Algorithms: • exhaustive algorithm: “no pruning, traverse entire list” • first-m: “a naïve algorithm with lists sorted by Pagerank; stop • after m elements in intersection found” • fancy first-m: “use fancy and non-fancy lists, each sorted • by Pagerank, and stop after m elements found” • reliable pruning: “stop when top-k results found” • fancy last-m: “stop when at most m elements unresolved” • single-node and parallel case with optimization

More Related