300 likes | 322 Vues
Learn about evaluating information retrieval systems by assessing expressiveness, efficiency, and effectiveness. Explore relevance, benchmarks, ranking measures, and precision-recall tradeoffs.
E N D
Information Retrieval For the MSc Computer Science Programme Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 Dell Zhang Birkbeck, University of London
Evaluating an IR System Is this search engine good? Which search engine is better? • Expressiveness • The ability of query language to express complex information needs • e.g., Boolean operators, wildcard, phrase, proximity, etc. • Efficiency • How fast does it index? How large is the index? • How fast does it search? • Effectiveness – the key measure • How effective does it find relevant documents
Relevance • How do we quantify relevance? • A benchmark set of docs (corpus) • A benchmark set of queries • A binary assessment for each query-doc pair • either relevant or irrelevant
Relevance • Relevance should beevaluatedaccording to the information need(which is translatedinto a query). • [information need]I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. • [query]wine red white heart attack effective • Wejudge whether the document addresses the information need, not whether it has those words.
Benchmarks • Common Test Corpora • TREC • National Institute of Standards and Testing (NIST) has run a large IR test bed for many years • Reuters • Reuters-21578 • RCV1 • 20 Newsgroups • …… Relevance judgements are given by human experts.
TREC • TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks • 50 detailed information needs per year • Human evaluation of pooled results returned • More recently other related things • QA, Web, Genomics, etc.
TREC • A Query from TREC5 • <top> • <num> Number: 225 • <desc> Description: • What is the main function of the Federal EmergencyManagement Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities? • </top>
Ranking-Ignorant Measures • The IR system returns a certain number of documents. • The retrieved documents are regarded as a set. • It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.
Contingency Table p = positive; n = negative; t = true; f = false.
Accuracy • Accuracy = (tp+tn) / (tp+fp+tn+fn) • The fraction of correct classifications. • Not a very useful evaluation measure in IR. • Why? • Accuracy puts equal weights on relevant and irrelevantdocuments. • It is common that the number of relevant documents is very small compared to the total number of documents. • People doing information retrieval want to find something and have a certain tolerance for junk.
Snoogle.com Search for: 0 matching results found. Accuracy This Web search engine returns 0 matching results for all queries. How much time do you need to build it? 1 minute! How much accuracy does it have? 99.9999%
Precision and Recall • PrecisionP = tp/(tp+fp) • The fraction of retrieved docs that are relevant. • Pr[relevant|retrieved] • RecallR = tp/(tp+fn) • The fraction of relevant docs that are retrieved. • Pr[retrieved|relevant] • Recall is a non-decreasing function of the number of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!
Precision and Recall • Precision/Recall Tradeoff • In a good IR system, • precision decreases as recall increases, • and vice versa.
Fmeasure • F: weighted harmonic mean of P and R • Combined measure that assesses theprecision/recall tradeoff. • Harmonic mean is a conservative average.
Fmeasure • F1: balancedF measure (with = 1 or = ½) • Most popular IR evaluation measure
F measure – Exercise IR result for q d1 d2 d3 d4 d5 retrieved F1 = ? relevant irrelevant
Ranking-Aware Measures • The IR system rank all docs in the decreasing order of their relevance to the query. • Returningvarious numbers of the top ranked docs leads to different recalls (and accordingly different precisions).
Precision-Recall Curve • The interpolated precision at a recall level R • The highest precision found for any recall level higher than R. • Removes the jiggles in the precision-recall curve.
11-PointInterpolated Average Precision • For each information need, the interpolated precision is measured at 11 recall levels • 0.0, 0.1, 0.2, …, 1.0 • The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark. • A composite precision-recall curve showing 11 points can be graphed.
11-PointInterpolated Average Precision • A representative (good) TREC system
Mean Average Precision (MAP) • For one information need, it is the average of the precision value obtained for the top k docs each time a relevant doc is retrieved. • No use of fixed recall levels. No interpolation. • When no relevant doc is retrieved, the precision value is taken to be 0. • The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs. • Macro-averaging: each query counts equally.
Precision/Recall at k • Prec@k: Precision on thetop kretrieved docs. • Appropriate for Web search engines • Most users scan only the first few (e.g., 10) hyperlinks that are presented. • Rec@k: Recallon thetop kretrieved docs. • Appropriate for archival retrieval systems • what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?
R-Precision • Precision on the top Rel retrieved docs • Rel is the size of the set of relevant documents (though perhaps incomplete). • A perfect IR system could score 1 on this metric for each query.
PRBEP • Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall. • It is obvious from the definition of precision/recall, the equality is achieved for contingency tables with tp+fp = tp+fn.
ROC Curve • An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1-specificity). • true positive rate or sensitivity = recall = tp/(tp+fn) • false positive rate = fp/(fp+tn) = 1 – specificity • specificty = tn/(fp+tn) • The area under the ROC curve
Variance in Performance • It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query. • For a test collection, anIR system may performterribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7). • There are easy information needs and hard ones!
Take Home Messages • Evaluation of Effectiveness based on Relevance • Ranking-Ignorant Measures • Accuracy; Precision & Recall • F measure (especially F1) • Ranking-Aware Measures • Precision-Recall curve • 11 Point, MAP, P/R at k, R-Precision, PRBEP • ROC curve