Evaluation

Evaluation

Types of Evaluation • Might evaluate several aspects • Evaluation generally comparative • System A vs. B • System A vs A´ • Most common evaluation - retrieval effectiveness • Assistance in formulating queries • Speed of retrieval • Resources required • Presentation of documents • Ability to find relevant documents

The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgments • Same users may judge differently at different times • Degree of relevance of different documents may vary

The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

Relevance • In a small collection - the relevance of each document can be checked • With real collections, never know full set of relevant documents • Any retrieval model includes an implicit definition of relevance • Satisfiability of a FOL expression • Distance • P(Relevance|query,document) • P(query|document)

potato blight … x Potato farming and nutritional value of potatoes. growing potatoes …  Mr. Potato Head … nutritional info for spuds x  x  Evaluation • Set of queries • Collection of documents (corpus) • Relevance judgements: Which documents are correct and incorrect for each query • If small collection, can review all documents • Not practical for large collections Any ideas about how we might approach collecting relevance judgments for very large collections?

Finding Relevant Documents • Pooling • Retrieve documents using several automatic techniques • Judge top n documents for each technique • Relevant set is union • Subset of true relevant set • Possible to estimate size of relevant set by sampling • When testing: • How should un-judged documents be treated? • How might this affect results?

Test Collections • To compare the performance of two techniques: • each technique used to evaluate same queries • results (set or ranked list) compared using metric • most common measures - precision and recall • Usually use multiple measures to get different views of performance • Usually test with multiple collections – • performance is collection dependent

Retrieved documents Let retrieved = 100, relevant = 25, rel & ret = 10 Rel&Ret documents Relevant documents Recall = 10/25 = .40 Ability to return ALL relevant items. Retrieved Precision = 10/100 = .10 Ability to return ONLY relevant items. Evaluation

Precision and Recall • Precision and recall well-defined for sets • For ranked retrieval • Compute value at fixed recall points (e.g. precision at 20% recall) • Compute a P/R point for each relevant document, interpolate • Compute value at fixed rank cutoffs (e.g. precision at rank 20)

Average Precision for a Query • Often want a single-number effectiveness measure • Average precision is widely used in IR • Calculate by averaging precision when recall increases

Averaging Across Queries • Hard to compare P/R graphs or tables for individual queries (too much data) • Need to average over many queries • Two main types of averaging • Micro-average - each relevant document is a point in the average (most common) • Macro-average - each query is a point in the average • Also done with average precision value • Average of many queries’ average precision values • Called mean average precision (MAP) • “Average average precision” sounds weird

Averaging and Interpolation • Interpolation • actual recall levels of individual queries are seldom equal to standard levels • interpolation estimates the best possible performance value between two known values • e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their precision at actual recall is .25, .22, and .15 • On average, as recall increases, precision decreases

Averaging and Interpolation • Actual recall levels of individual queries are seldom equal to standard levels • Interpolated precision at the ith recall level, Ri, is the maximum precision at all points p such that Ri p  Ri+1 • assume only 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 • their precision is .25, .22, and .15 • what is interpolated precision at standard recall points? Recall levelInterpolated Precision 0.0, 0.1, 0.2, 0.3 0.25 0.4, 0.5, 0.6 0.22 0.7, 0.8, 0.9, 1.0 0.15

Interpolated Average Precision • Average precision at standard recall points • For a given query, compute P/R point for every relevant doc. • Interpolate precision at standard recall levels • 11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) • 3-pt is usually 75%, 50%, 25% • Average over all queries to get average precision at each recall level • Average interpolated recall levels to get single result • Called “interpolated average precision” • Not used much anymore; “mean average precision” more common • Values at specific interpolated points still commonly used

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging: 1 Qry Find precision given total number of docs retrieved at given recall value.

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 Find precision given total number of docs retrieved at given recall value.

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 30% Recall: .3 * 10 rel docs = 3 rel docs retrieved 6 docs retrieved to get 3 rel docs: precision = 3/6 = 0.5 What is precision at recall values from 40-100%?

120 Recall/ Precision Curve • • 100 80 • Precision 60 • • 40 • 20 • • • • • 0 20 40 60 80 100 120 Recall • |Rq|= 10, no. of relevant docs for q • Ranking of retreived docs in the answer set of q: Recall Precision 0.1 1/1 = 100% 0.2 2/3 = 0.67% 0.3 3/6 = 0.5% 0.4 4/10 = 0.4% 0.5 5/15 = 0.33% 0.6 0% … … 1.0 0%

Averaging and Interpolation • macroaverage - each query is a point in the avg • can be independent of any parameter • average of precision values across several queries at standard recall levels e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 (why?) • their precision is .25, .22, and .15 (why?) • Average over all relevant docs • rewards systems that retrieve relevant docs at the top (.25+.22+.15)/3= 0.21

Recall-Precision Tables & Graphs

Document Level Averages • Precision after a given number of docs retrieved • e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents • Reflects the actual system performance as a user might see it • Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries • e.g. average precision for all queries at the point where n docs have been retrieved

R-Precision • Precision after R documents are retrieved • R = number of relevant docs for the query • Average R-Precision • mean of the R-Precisions across all queries e.g.) Assume 2 qrys having 50 & 10 relevant docs; system retrieves 17 and 7 relevant docs in the top 50 and 10 documents retrieved, respectively

Evaluation • Recall-Precision value pairs may co-vary in ways that are hard to understand • Would like to find composite measures • A single number measure of effectiveness • primarily ad hoc and not theoretically justifiable • Some attempt to invent measures that combine parts of the contingency table into a single number measure

Contingency Table Miss = C/(A+C)

Symmetric Difference A is the retrieved set of documents B is the relevant set of documents A  B (the symmetric difference) is the shaded area

E measure (van Rijsbergen) • used to emphasize precision or recall • like a weighted average of precision and recall • large a increases importance of precision • can transform by a = 1/(b2 +1), b = P/R • when a = 1/2, b = 1; precision and recall are equally important E= normalized symmetric difference of retrieved and relevant sets E b=1 = |A B|/(|A| + |B|) • F =1- E is typical (good results mean larger values of F)

Expected Search Length • Evaluation is based on type of information need e.g.) • only one relevant document required • some arbitrary number n • all relevant documents • a given proportion of relevant documents….. • Search strategy output assumed to be weak ordering • Simple ordering means never have two or more documents at the same level of the ordering • Search length in a simple ordering is the number of non-relevant documents a user must scan before the information need is satisfied • Expected search length appropriate for weak ordering

Expected Search Length

Other Single-Valued Measures • Breakeven point • point at which precision = recall • Swets model • use statistical decision theory to express recall, precision, and fallout in terms of conditional probabilities • Utility measures • assign costs to each cell in the contingency table • sum (or average) costs for all queries • Many others...

Evaluation

Evaluation

Presentation Transcript

evaluation

Evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation Economic Evaluation

Evaluation