340 likes | 468 Vues
This document explores various methodologies for assessing relevance in information retrieval, emphasizing user feedback and system performance. It discusses personal assessment of relevance, the extension of dialog with relevance feedback (RelFbk), and aggregated assessments of search engine performance. Key components include cognitive assumptions affecting user opinions, the use of nonmetric relevance scales, and effective measures for evaluating retrieval success such as precision and recall metrics. The insights aim to enhance the effectiveness of retrieval systems and user satisfaction.
E N D
Assessing The Retrieval A.I Lab 2007.01.20 박동훈
Contents • 4.1 Personal Assessment of Relevance • 4.2 Extending the Dialog with RelFbk • 4.3 Aggregated Assessment : Search Engine Performance • 4.4 RAVE : A Relevance Assessment Vehicle • 4.5 Summary
4.1 Personal Assessment of Relevance • 4.1.1 Cognitive Assumptions • Users trying to do ‘object recognition’ • Comparison with respect to prototypic document • Reliability of user opinions? • Relevance Scale • RelFbk is nonmetric
RelFbk is nonmetric • Users naturally provides only preference information • Not(metric) measurement of how relevant a retrieved document is!
4.2 Extending the Dialog with RelFbk RelFbk Labeling of the Retr Set
Fig 4.7 Change documents!? More/less the query that successfully / un matches them 4.2.2 Document Modifications due to RelFbk
4.3 Aggregated Assessment : Search Engine Performance • 4.3.1 Underlying Assumptions • RelFbk(q,di) assessments independent • Users’ opinions will all agree with single ‘omniscient’ expert’s
4.3.2 Consensual relevance Consensuallyrelevant
4.3.4 Basic Measures • Relevant versus Retrieved Sets
Contingency table • NRet : the number of retrieved documents • NNRet : the number of documents not retrieved • NRel : the number of relevant documents • NNRel : the number of irrelevant documents • NDoc : the total number of documents
4.3.5 Ordering the Retr set • Each document assigned hitlist rank Rank(di) • Descending Match(q,di) • Rank(di)<Rank(dj) ⇔ Match(q,di)>Match(q,dj) • Rank(di)<Rank(dj) ⇔ Pr(Rel(di))>Pr(Rel(dj)) • Coordination level : document’s rank in Retr • Number of keywords shared by doc and query • Goal:Probability Ranking Principle
A tale of tworetrievals Query1 Query2
Recall/precision curve Query1
Recall/precision curve Query1
4.3.6 Normalized recall Best Worst ri : i번째 relevant doc 의 hitlist rank
4.3.8 One-Parameter Criteria • Combining recall and precision • Classification accuracy • Sliding ratio • Point alienation
Combining recall and precision • F-measure • [Jardine & van Rijsbergen71] • [Lewis&Gale94] • Effectiveness • [vanRijsbergen, 1979] • E=1-F, α=1/(β2+1) • α=0.5=>harmonic mean of precision & recall
Classification accuracy • accuracy • Correct identification of relevant and irrelevant
Sliding ratio • Imagine a nonbinary, metric Rel(di) measure • Rank1, Rank2 computed by two separate systems
Point alienation • Developed to measure human preference data • Capturing fundamental nonmetric nature of RelFbk
4.3.9 Test corpora • More data required for “test corpus” • Standard test corpora • TREC:Text Retrieval Evaluation Conference • TREC’s refined queries • TREC constantly expanding, refining tasks
More data required for “test corpus” • Documents • Queries • Relevance assessments Rel(q,d) • Perhaps other data too • Classification data (Reuters) • Hypertext graph structure (EB5)
TREC constantly expanding,refining tasks • Ad hoc queries tasks • Routing/filtering task • Interactive task
Other Measure • Expected search length (ESL) • Length of “path” as user walks down HitList • ESL=Num. irrelevant documents before each relevant document • ESL for random retrieval • ESL reduction factor
4.5 Summary • Discussed both metric and nonmetric relevance feedback • The difficulties in getting users to provide relevance judgments for documents in the retrieved set • Quantified several measures of system perfomance