1 / 73

Information Retrieval and Search Engines

Information Retrieval and Search Engines. Lecture 5: Computing Scores in a Complete Search System Prof . Michael R. Lyu. Motivation of This Lecture. How Google conducts empirical experiments to find out RANK is important in IR?

leo-love
Télécharger la présentation

Information Retrieval and Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Search Engines Lecture 5: Computing Scores in a Complete Search System Prof. Michael R. Lyu

  2. Motivation of This Lecture • How Google conducts empirical experiments to find out RANK is important in IR? • The probability of relevance is not proportional to probability of retrieval; why this happens in cosine normalization?

  3. Motivation of This Lecture • How to improve the efficiency of ranking? • How to design the structure of index to improve performance? • What is the overall picture of a search engine system?

  4. Outline (Ch. 7 of IR Book) • Recap • Why Rank? • More on Cosine • Implementation of Ranking • The Complete Search System

  5. Outline (Ch. 7 of IR Book) • Recap • Why Rank? • More on Cosine • Implementation of Ranking • The Complete Search System

  6. Term Frequency Weight • The log frequency weight of term t in d is defined as follows

  7. Idf Weight • The document frequency dft is defined as the number of documents that t occurs in • We define the idf weightof term t as follows: • Idf is a measure of theinformativenessof the term

  8. Tf-idf Weight • The tf-idf weight of a term is the product of its tf weight and its idf weight

  9. Cosine Similarity between Query and Document • qi is the tf-idf weight of term iin the query • di is the tf-idf weight of termiin the document • and are the lengths of and • and are normalized vectors

  10. Cosine Similarity Illustrated

  11. Take-away Today • The importance of ranking: user studies at Google • Lengthnormalization: Pivot normalization • Implementationofranking • The completesearchsystem

  12. Outline (Ch. 7 of IR Book) • Recap • Why Rank? • More on Cosine • Implementation of Ranking • The Complete Search System

  13. Why is Ranking so Important? • Last lecture: Problems with unranked retrieval • Users want to look at a few results – not thousands • It’s very hard to write queries that produce a few results • → Ranking is important because it effectively reduces a large set of results to a very small one • Next: More data on “users only look at a few results” • Actually, in the vast majority of cases they only examine 1, 2, or 3 results

  14. Empirical Investigation of the Effect of Ranking • How can we measure how important ranking is? • Observe what searchers do when they are searching in a controlledsetting • Videotape them • Ask them to “think aloud” • Interview them • Eye-trackthem • Time them • Record and count their clicks • The following slides are from Dan Russell’s JCDL talk, 2007 • Dan Russell is the “Über Tech Lead for Search Quality & User Happiness” at Google

  15. How do Users Behave in Search • Experiment conduct at Cornel [Gay, Granka, et al., 2004] • Users • Searched freely with any queries • Script removed all ad content • 5 informational and 5 navigational tasks given to participants • Subjects • 36 undergraduate students • Familiar with Google

  16. How do Users Behave in Search • How many results are viewed before clicking? • Do users select the first relevant-looking result they see? • How much time is spent viewing results page?

  17. Why is there a dip?

  18. Strong Implicit Behavior • Users strongly believe that the search engine rank order matters

  19. SERP: Search Engine Result Page http://en.wikipedia.org/wiki/Search_engine_results_page

  20. Importance of Ranking: Summary • Viewingabstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). • Clicking: Distribution is even more skewed for clicking • In 1 out of 2 cases, users click on the top-ranked page • Even if the top-ranked page is notrelevant, 30% of users will click on it • Getting the ranking right is very important • Getting the top-ranked page right is most important

  21. Outline (Ch. 7 of IR Book) • Recap • Why Rank? • More on Cosine (Ch. 6.4.4) • Implementation of Ranking • The Complete Search System

  22. Cosine Normalization Problem • Likelihood of relevance/retrieval experiment • Singhal, Buckley and Mitra, SIGIR • Setting • 50 TREC queries, 741,856 TREC documents • Sort documents in order of increasing length, divide list into bins of one thousand documents each, yielding 742 bins, select median document length to represent each bin • Take 9,805 <query, relevant-document> pairs for 50 queries, compute P(D belongs to Bi | D is relevant). The ratio of number of pairs that have their documents from the ith bin, and the total number of pairs • Retrieving the top one thousand documents for each query (50,000 <query, retrieve-document> pairs), compute P(D belongs to Bi | D is retrieved)

  23. Predicted and True Probability of Relevance t • source: • Lillian Lee

  24. Recap: Distance Is a Bad Idea The Euclideandistanceofand is large; but the distribution of terms in query q and document d2are similar. That’s why we do length normalization or, equivalently, use cosine to compute query-document matching scores. 29

  25. Exercise: A Problem for Cosine Normalization • Query q: “anti-doping rules Beijing 2008 olympics” • Compare three documents • d1: a short document on anti-doping rules at 2008 Olympics • d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti-doping • d3: a short document on anti-doping rules at the 2004 Athens Olympics • What ranking do we expect in the vector space model? • What can we do about this? 30

  26. Pivot Normalization • Cosine normalization produces weights that are too large for short documents and too small for long documents (on average) • Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot • Effect: Similarities of short documents with query decrease; similarities of long documents with query increase • This removes the unfairadvantage that short documents have

  27. Pivot Normalization • source: • Lillian Lee

  28. Outline (Ch. 7 of IR Book) • Recap • Why Rank? • More on Cosine • Implementation of Ranking • The Complete Search System

  29. Term Frequncies in the Index • Term frequencies • We also needpositions. Not shownhere

  30. Term Frequencies in the Inverted Index • In each posting, store tft,d in addition to docIDd • As an integer frequency, not as a (log-)weighted real number (Why?) • Real numbers are difficult to compress • Overall, additional space requirements are small • Less than a byte per postingwithbitwisecompression. • A byte per posting with variable byte code

  31. How do we Compute the Top K in Ranking? • In many applications, we don’t need a complete ranking • We just need the top k for a small k (e.g., k = 100) • If we don’t need a complete ranking, is there an efficient way of computing just the top k? • Naive: • Compute scores for all N documents • Sort • Return the top k • What’sbadaboutthis? • Alternative?

  32. Use Min Heap for Selecting Top kOut of N • Use a binary min heap • A binary min heap is a binary tree in which each node’s value is less than the values of its children • Takes O(N log k) operations to construct (where N is the number of documents) • Read off k winners in O(k log k) steps

  33. Binary Min Heap

  34. Selecting Top k Scoring documents in O(N log k) • Goal: Keep the top k documents seen so far • Use a binary min heap • To process a new document d′ with score s′: • Get current minimum hm of heap (O(1)) • If s′ ˂hmskip to next document • Ifs′ > hmheap-delete-root (O(log k)) • Heap-addd′/s′ (O(log k))

  35. More Efficient Computation of Top K: Heuristics • Idea 1: Reorder postings lists • Instead of ordering according to docID • Order according to some measure of “expected relevance” • Idea 2: Heuristics to prune the search space • Not guaranteed to be correct , but fails rarely • In practice, close to constant time • For this, we’ll need the concepts of document-at-a-time processing and term-at-a-time processing

  36. Non-DocIDOrdering of Postings Lists • So far: postings lists have been ordered according to docID • Alternative: a query-independent measure of “goodness” of a page • Example: PageRankg(d) of page d, a measure of how many “good” pages hyperlink to d (Chapter 21) • Order documents in postings lists according to PageRank: g(d1) > g(d2) > g(d3) > . . . • Define composite score of a document • net-score(q, d) = g(d) + cos(q, d) • This scheme supports early termination: We do not have to process postings lists in their entirety to find top k

  37. Non-DocIDOrdering of Postings Lists (2) • Order documents in postings lists according to a PageRank g(di) : g(d1) > g(d2) > g(d3) > . . . • Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) • Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document d we’re currently processing; (iii) smallest top k score we’ve found so far is 1.2 • Then all subsequent scores will be < 1.1 • So we’ve already found the top k and can stop processing the remainder of postings lists

  38. In Class Practice 1 • Link

  39. Document-at-a-Time Processing • Both DocID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists • Computing cosines in this scheme isdocument-at-a-time • We complete computation of the query-document similarity score of document dibefore starting to compute the query-document similarity score of di+1

  40. Term-at-a-Time Processing • Simplest case • Completely process the postingslist of the firstqueryterm • Create an accumulator for each DocIDyou encounter • Then completely process the postings list of the second queryterm, and so forth

  41. Term-at-a-Time Processing (Shown in last lecture) => wt,q = 1 for fast Cosine Score

  42. Computing Cosine Scores • For the web (20 billion documents), an array of accumulatorsAin memory is infeasible • Thus: Only create accumulators for docs occurring in postings lists • Do not create accumulators for docs with zero scores (i.e., docs that do not contain any of the queryterms)

  43. Accumulators: Example • Forquery: [Brutus Caesar]: • Only need accumulators for 1, 5, 7, 13, 17, 83, 87 • Don’t need accumulators for 8, 40, 97

  44. Removing Bottlenecks • Use heap / priority queue as discussed earlier • Can further limit to docs with non-zero cosines on rare (high idf) words • Or enforce conjunctive search (Google) • Non-zero cosines on all words in query

  45. Removing Bottlenecks • Just one accumulator for [Brutus Caesar] in the example • Because only d1 contains both words

More Related