990 likes | 1.18k Vues
Information Retrieval and Search Engines. Lecture 6: Evaluation, Relevance Feedback and Query Expansion Prof. Michael R. Lyu. Outline (Ch. 8, 9 of IR Book). Recap Motivation Evaluation in Information Retrieval Relevance Feedback Query Expansion. Outline (Ch. 8, 9 of IR Book). Recap
E N D
Information Retrieval and Search Engines Lecture 6: Evaluation, Relevance Feedback and Query Expansion Prof. Michael R. Lyu
Outline (Ch. 8, 9 of IR Book) • Recap • Motivation • Evaluation in Information Retrieval • Relevance Feedback • Query Expansion
Outline (Ch. 8, 9 of IR Book) • Recap • Motivation • Evaluation in Information Retrieval • Relevance Feedback • Query Expansion
Importance of Ranking: Summary • Viewingabstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). • Clicking: Distribution is even more skewed for clicking • In 1 out of 2 cases, users click on the top-ranked page • Even if the top-ranked page is notrelevant, 30% of users will click on it • Getting the ranking right is very important • Getting the top-ranked page right is most important
Predicted and True Probability of Relevance • source: • Lillian Lee
Pivot Normalization • Cosine normalization produces weights that are too large for short documents and too small for long documents (on average) • Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot • Effect: Similarities of short documents with query decrease; similarities of long documents with query increase • This removes the unfairadvantage that short documents have
Sec. 7.1.6 Cluster Pruning: Query Processing • Process a query as follows: • Given query Q, find its nearest leader L • Seek K nearest docs from among L’s followers
Sec. 7.1.6 Visualization Query Leader Follower
Tiered Indexes • Basic idea: • Create several tiers of indexes, corresponding to importance of indexingterms • During query processing, start with highest-tier index • If highest-tier index returns at least k (e.g., k = 100) results: stop and returnresultstouser • If we’ve only found < k hits: repeat for next index in tier cascade
Tiered Indexes • Example • Two-tier system • Tier 1: Index of all titles • Tier 2: Index of the rest of documents • Pages containing the search words in the title are better hits than pages containing the search words in the body of the text
Outline (Ch. 8, 9 of IR Book) • Recap • Motivation • Evaluation in Information Retrieval • Relevance Feedback • Query Expansion
Evaluation • How do we measure which search engine returns a better result? • How do we measure users’ happiness?
How Can We Improve Recall in Search? • Two ways of improving recall: relevancefeedbackandqueryexpansion • As an exampleconsiderqueryq: [aircraft]and document d containing “plane”, but not containing “aircraft” • A simple IR system will not return d for q • Even if d is the most relevant document for q! • We want to change this: • Return relevant documents even if there is no term match with the (original) query
Options forImprovingRecall • Local: Do a “local”, on-demand analysis for a user query • Main local method: relevance feedback • Global: Do a global analysis once (e.g., of collection) to producethesaurus • Use thesaurus for query expansion
Outline (Ch. 8, 9 of IR Book) • Recap • Motivation • Evaluation in Information Retrieval • Relevance Feedback • Query Expansion
Measures for a Search Engine • How fast does it index • E.g., number of bytes per hour • How fast does it search • E.g., latency as a function of queries per second • What is the cost per query? • In dollars 16
Measures for a Search Engine • All of the preceding criteria are measurable: we can quantify speed/size/money • However, the key measure for a search engine is user happiness • Whatisuserhappiness? • Factorsinclude: • Speed ofresponse • Size ofindex • Uncluttered UI • Note that none of these is sufficient: blindingly fast, but useless answers won’t make a user happy • Most important: relevance • How can we quantify user happiness? 17
Who Is the User? • Who is the user we are trying to make happy? • Web search engine • Searcher • Success: searcher finds what he/she was looking for • Measure: rate of return to this search engine • Advertiser • Success: searcher clicks on ad • Measure: clickthrough rate • Click through rate (CTR) • # of clicks on an ad/# of times the ad is shown 18
Who Is the User • Enterprise • CEO • Success: employees are more productive (because of effective search) • Measure: profit of the company
Who Is the User? • Ecommerce • Buyer • Success: buyer buys something • Measures: time to purchase, fraction of “conversions” of searchers to buyers • Seller • Success: seller sells something • Measure: profit per item sold
Most Common Definition of User Happiness: Relevance • User happiness is equated with the relevance of search results to the query • But how do you measure relevance? • Standard methodology in information retrieval consists of three elements • A benchmark document collection • A benchmarksuiteofqueries • An assessment of the relevance of each query-document pair 21
Relevance: Query vs. Information Need • Relevancetowhat? • First take: relevance to the query • “Relevance to the query” is very problematic • Information need i: “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” • Query q: [red wine white wine heart attack] • Consider document d′: “At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.“ 22
Relevance: Query vs. Information Need • Relevancetowhat? • First take: relevance to the query • “Relevance to the query” is very problematic • Information need i: “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” • Query q: [red wine white wine heart attack] • Consider document d′: “At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.“ • d′ is an excellent match for query q • d′ is not relevant to the information needi 23
Relevance: Query vs. Information Need • User happiness can only be measured by relevance to an information need, not by relevance to queries • We talk about query-document relevance judgments even though we mean information-need-document relevance judgments 24
Precision and Recall • Precision (P) is the fraction of retrieved documents that are relevant 25
Precision and Recall • Recall (R) is the fraction of relevant documents that are retrieved
Precision at K • Precision at K documents (P@K) is the number of relevant results on the first K results
Precision and Recall • P = TP/ ( TP + FP ) • R = TP / ( TP + FN ) 28
Precision/Recall Tradeoff • You can increase recall by returning more docs • Recall is a non-decreasing function of the number of docs retrieved • A system that returns all docs has 100% recall! • The converse is also true (usually): It’s easy to get high precision for very low recall • Suppose the document with the largest score is relevant. How canwemaximizeprecision? 29
A Combined Measure: F • F allows us to trade off precision against recall • where • α ϵ[0, 1] and thus b2ϵ [0,∞] • Most frequently used: balanced F (F1)with b = 1 or α = 0.5 • This is the harmonic mean of P and R: • Value of b < 1 emphasize precision • Value of b > 1 emphasize recall 30
Example 1 • What is Precision, Recall and F1? 31
Answer Example 1 • P = 20/(20 + 40) = 1/3 • R = 20/(20 + 60) = 1/4 32
Example 2 • An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?
Answer Example 2 • An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall? • Answer:
Accuracy • Why do we use complex measures like precision, recall, and F? • Why not something simple like accuracy? • Accuracy is the fraction of decisions (relevant/non-relevant) that are correct • In terms of the contingency table: • accuracy = (TP + TN)/(TP + FP + FN + TN) • Why is accuracy not a useful measure for Web information retrieval? 35
Exercise • Compute precision, recall and F1 for this result set: • The snoogle search engine below always returns 0 results (“0 matching results found”), regardless of the query. Why does snoogle demonstrate that accuracy is not a useful measure in IR? 36
Why Accuracy is a Useless Measure in IR • Simple trick to maximize accuracy in IR: always say no and returnnothing • You then get 99.99% accuracy on most queries • Searchers on the Web (and in IR in general)want to find something and have a certain tolerance for junk • It’s better to return some bad hits as long as you return something • →We use precision, recall, and F for evaluation, not accuracy 37
F: Why Harmonic Mean? • Why don’t we use a different mean of P and R as a measure? • E.g., the arithmetic mean • The simple (arithmetic) mean is 50% for “return-everything” search engine, which is too high • Punish really bad performance on either precision or recall • Taking the minimum achieves this • But minimum is not smooth and hard to weight • F (harmonic mean) is a kind of smooth minimum 38
Maximum Maximum Arithmetic Minimum Geometric Harmonic Minimum
Outline (Ch. 8, 9 of IR Book) • Recap • Motivation • Evaluation in Information Retrieval • Relevance Feedback • Query Expansion
RelevanceFeedback: Basic Idea • Basic Idea • It may be difficult to formulate a good query when you don’t know the collection well, or cannot express it, but can judge relevance of a result. So iterate … • The user issues a (short, simple) query • The search engine returns a set of documents • User marks some docs as relevant, some as non-relevant • Search engine computes a new representation of the information need. Hope: better than the initial query • Search engine runs new query and returns new results • New results have (hopefully) better recall
RelevanceFeedback • We can iterate this: several rounds of relevance feedback • We will use the term ad hoc retrievalto referto regular retrievalwithoutrelevancefeedback • We will now look at three different examples of relevance feedback that highlight different aspects of the process
User Feedback: Select Whatis Relevant
VectorSpace Example: Query “canine” (1) • Source: • Fernando Díaz
SimilarityofDocstoQuery “canine” • Source: • Fernando Díaz
User Feedback: Select Relevant Documents • Source: • Fernando Díaz
Results after RelevanceFeedback • Source: • Fernando Díaz