Chapter 8
E N D
Presentation Transcript
Chapter 8 Evaluating Search Engine
Evaluation • Evaluation is key to building effective and efficient search engines • Measurement usually carried out in controlled laboratory experiments • Online testing can also be done • Effectiveness, efficiency, and cost are related • e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration • Efficiency and cost targets may impact effectiveness & vice versa
Evaluation Corpus • Test collections consisting of documents, queries, and relevance judgments, e.g.,
Test Collections Bibliographic Records Full-text Documents Policy Descriptions Based on TREC topics
TREC Topic Example Short Query Long Query Criteria for Relevance
Relevance Judgments • TREC judgments • Depend on task being evaluated, e.g., topical relevance • Generally binary, i.e., relevant vs. non-relevant • Agreement good because of “narrative” • Emphasize on high recall • Obtaining relevance judgments is an expensive, time- consuming process that requires manual effort • Who does it? • What are the instructions? • What is the level of agreement?
Effectiveness Measures A is set of relevant documents; B is set of retrieved documents Collection Relevant Non-Relevant RetrievedA B A B Not RetrievedA B A B A B A B Relevant Set, A Retrieved Set, B | A B | | A | Precision = Recall = (positive predictive value) (true positive rate) | A B | | B | Retrieved Relevant , AB
Classification Errors • Precision is used when probability that a positive result is correct is important • False Positive (Type I Error) • Non-relevant documents retrieved: | A B | • False positive/False alarm rate: • False Negative (Type II Error) • Relevant documents not retrieved: | A B | • FN Ratio: 1 - Recall
F Measure • Harmonic mean of recall&precision, a single measure • Harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large • More general form • β is a parameter that determines relative importance of recall and precision • What if β = 1?
Ranking Effectiveness (1/6 1/6 2/6 3/6 4/6 5/6 5/6 5/6 5/6 6/6) (1/1 1/2 2/3 3/4 4/5 5/6 5/7 5/8 5/9 6/10) (0/6 1/6 1/6 1/6 2/6 3/6 4/6 4/6 5/6 6/6) (0/1 1/2 1/3 1/4 2/5 3/6 4/7 4/8 5/9 6/10)
Summarizing a Ranking • Calculating recall and precision at fixed rank positions • Calculating precision at standard recall levels, from 0.0 to 1.0 • Requires interpolation • Averaging the precision values from the rank positions where arelevantdocument was retrieved
Average Precision Mean Average Precision: (0.78 + 0.52) / 2 = 0.65
Averaging • Mean Average Precision (MAP) • Summarize rankings from multiple queries by averaging average precision • Most commonly used measure in research papers • Assumes user is interested in finding many relevant documents for each query • Requires many relevance judgments in text collection • Recall-precision graphs are also useful summaries
Recall-Precision Graph • Too much information could be lost using MAP • Recall-precision graphs provide more detail on the effectiveness of the ranking at different recall levels • Example. The recall-precision graphs for the two previous queries Ranking of Query 1 Ranking of Query 2
Precision and Recall • Precision versus Recall Curve • A visual display of the evaluation measure of a retrieval strategy such that documents are ranked accordingly • Example. • Let q be a query in the text reference collection • Let Rq = { d3, d5, d9, d25, d39, d44, d56, d71, d89, d123 } be the set of 10 relevant documents for q • Assume the following ranking of retreived documents in the answer set of q: 1. d123 6. d9 11. d38 2. d84 7. d511 12. d48 3. d56 8. d129 13. d250 4. d6 9. d187 14. d113 5. d8 10. d25 15. d3
• • 100 80 • 60 • Precision • 40 • 20 • • • • • 0 Recall 50 10 30 20 40 60 80 100 Precision Versus Recall Curve • Example. • |Rq|= 10, total number of relevant documents for q • Given ranking of retrieved documents in the answer set of q: 1. d123 6. d9 11. d38 2. d84 7. d511 12. d48 3. d56 8. d129 13. d250 4. d6 9. d187 14. d113 5. d8 10. d25 15. d3 100% An Interpolated Value 66% 50% 40% 33%
Nq __ Nq P(r) = i=1 i=1 Precision and Recall • To evaluate the performance of a retrieval strategy over all test queries, compute the average of precision at each recall level: where P(r) = average precision at the rthrecall level Pi (r) =precision at the rthrecall level for the ith query Nq = number of queries used Pi(r) Nq 1 = Pi(r) Nq __
100 80 Interpolated precision at the 11 recall levels 60 Precision 40 • • • • • • • • • • • • • 20 0 30 50 70 10 90 Recall 20 40 60 80 100 Precision Versus Recall Curve • Example. • Rq = {d3, d56, d129} & |Rq|= 3, number of relevant docs for q • Given ranking of retreived documents in the answer set of q: 1. d123 6. d9 11. d38 2. d84 7. d511 12. d48 3. d56 8. d129 13. d250 4. d6 9. d187 14. d113 5. d8 10. d25 15. d3 33% 25% 20%
100 80 Interpolated precision at the 11 recall levels 60 Precision 40 • • • • • • • • • • • • • 20 0 30 50 70 10 90 Recall 20 40 60 80 100 Interpolation of Precision at Various Recall Levels • Let ri (0 i< 10) be a reference to the ith recall level P(ri) = max ri ≤ r≤ ri+1P(r) • Interpolation of previous levels by using the next known level
120 • • Average precision at each recall level for two distinct retrieval strategies 100 80 • • 60 Precision • 40 • • • • • • • • • • • • 20 • • • • • 0 Recall 20 40 60 80 100 Precision Versus Recall Curve • Compare the retrieval performance of distinct retrieval strategies using precision versus recall figures • Average precision versus recall figures become a standard evaluation strategy for IR systems • Simple and intuitive • Qualify the overall answer set
Focusing on Top Documents • Users tend to look at only the top part of the ranked result list to find relevant documents • Some search tasks have only one relevant doc (P@1) in mind ̶ the top-ranked doc is expected to be relevant • e.g., in question-answering system, navigation search, etc. • Precision at R (5, 10, or 20): ratio of top ranked relevant docs • Easy to compute, average, understand • Not sensitive to rank positions less than R, higher or lower • Recall not an appropriate measure in this case • Instead need to measure how well the search engine does at retrieving relevant documents at very high ranks
Focusing on Top Documents • Reciprocal Rank • Reciprocal of the rank at which the 1st relevant doc retrieved • Very sensitive to rank position, e.g., dn, dr, dn, dn, dn (RR = ½), whereas dn, dn, dn, dn, dr (RR = 1/5) • Mean Reciprocal Rank (MRR)is the average of the reciprocalranks over a set of queries, i.e., Number of Queries Ranking position of the firstrelevant document Normalization factor
Discounted Cumulative Gain • Popular measure for evaluating web search and related tasks • Two assumptions: • Highly relevant documents are more useful than marginally relevant document • The lower the ranked position of a relevant document, the lessuseful it is for the user, since it is less likely to be examined
Discounted Cumulative Gain • Uses graded relevance (i.e., a value) as a measure of the usefulness, or gain, from examining a document • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks • Typical discount is 1 / log2(rank) • With base 2, the discount at rank 4 is , and at rank 8 it is
Discounted Cumulative Gain • DCG is the total gain accumulated at a particular rank p: where reli is the graded relevance of the document at rank i, e.g., ranging from “Bad” to “Perfect” is 0 reli 5, or 0 or 1 • Alternative formulation: • Used by some web search companies • Emphasis on retrieving highly relevant documents log2i is the discount/reductionfactor applied to the gain
DCG Example • Assume that there are 10 ranked documents judged on a 0-3 (“not relevant” to “highly relevant”) relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 which represent the gain at each rank • The discounted gain is 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 • DCG at each rank is formed by accumulating the numbers 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 (5) (10)
Normalized DCG • DCG numbers are averaged across a set of queries at specific rank values, similar to precision@p • e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61 • DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfectranking NDCGp = DCGp / IDCGp • Makes averaging easier for queries with different numbers of relevant documents
NDCG Example • Perfect ranking for which each relevance level is on the scale 0-3: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 • IdealDCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88 • Given the DCG values are 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, the NDCG values: 1, 0.83, 0.87, 0.77, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 • NDCG 1 at any rank position (5) (10) (5) (10) (5) (10)
Significance Tests • Given the results from a number of queries, how can we conclude that (a new) ranking algorithm B is better than (the baseline) algorithm A? • A significance test, which allows us to quantify, enables us to reject the null hypothesis (no difference) in favor of the alternative hypothesis (there is a difference) • The power of a test is the probability that the test will reject the null hypothesis correctly • Increasing the number of queries in the experiment also increases power (confidence in any judgment) of test • Type 1 Error: the null hypothesis is rejected when it is true • Type 2 Error: the null hypothesis is accepted when it is false
Significance Tests Standard Procedure for comparing two retrieval systems: 0.01,
Significance Test P = 0.05 Test Statistic Value x Probability distribution for test statistic values assuming the null hypothesis. The shaded area is the region of rejection for a one-sided test.
t-Test • Assumption is that the difference between the effectiveness data values is a sample from a normal distribution • Null hypothesis is that the mean of the distribution of differences is zero • Test statistic for the paired t-test is • Example. Given that N = 10 such that A = < 25, 43, 39, 75, 43, 15, 20, 52, 49, 50 > and B = < 35, 84, 15, 75, 68, 85, 80, 50, 58, 75 > • QuickCalcs (http://www.graphpad.com/quickcalcs/ttest2.cfm) Mean of the differences One-tailed P Value Standard derivation
Wilcoxon Signed-Ranks Test • Nonparametric test based on differences between effectiveness scores • Test statistic • To compute the signed-ranks, the differences are ordered by their absolute values (increasing), and then assigned rank values • Rank values are given the sign of the original difference • Sample website (http://scistatcalc.blogspot.com/2013/10/wilcoxon-signed- rank-test-calculator.html)
Wilcoxon Test Example • 9 non-zero differences are (in order of absolute value): 2, 9, 10, 24, 25, 25, 41, 60, 70 • Signed-ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 • w (sum of the signed-ranks) = 35, p-value = 0.038 • Null hypothesis holds if of ‘+’ ranks = of the ‘-’ ranks Two-tailed P Value