Information Retrieval: Problem Formulation & Evaluation

Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Research Process • Identification of a research question/topic • Propose a possible solution/answer (formulate a hypothesis) • Implement the solution • Design experiments (measures, data, etc) • Test the solution/hypothesis • Draw conclusions • Repeat the cycle of question-answering or hypothesis-formulation- and-testing if necessary Today’s lecture

Part 1: IR Problem Formulation

Basic Formulation of TR (traditional) • Vocabulary V={w1, w2, …, wN} of language • Query q = q1,…,qm, where qi  V • Document di = di1,…,dimi, where dij  V • Collection C= {d1, …, dk} • Set of relevant documents R(q)  C • Generally unknown and user-dependent • Query is a “hint” on which doc is in R(q) • Task = compute R’(q), an “approximate R(q)” (i.e., decide which documents to return to a user)

Computing R(q) • Strategy 1: Document selection • R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier • System must decide if a doc is relevant or not (“absolute relevance”) • Strategy 2: Document ranking • R(q) = {dC|f(d,q)>}, where f(d,q)  is a relevance measure function;  is a cutoff • System must decide if one doc is more likely to be relevant than another (“relative relevance”)

Document Selection vs. Ranking - - + - - - + - + - + + + - R’(q) R’(q) 1 True R(q) Doc Selection f(d,q)=? - - - 0 - + - + - - + + - + - - - - - - - - - 0.98 d1 + 0.95 d2 + 0.83 d3 - 0.80 d4 + 0.76 d5 - 0.56 d6 - 0.34 d7 - 0.21 d8 + 0.21 d9 - - Doc Ranking f(d,q)=? - User sets the threshold

Problems of Doc Selection/Boolean model [Cooper 88] • The classifier is unlikely accurate • “Over-constrained” query (terms are too specific): no relevant documents found • “Under-constrained” query (terms are too general): over delivery • It is hard to find the right position between these two extremes (hard for users to specify constraints) • Even if it is accurate, all relevant documents are not equally relevant; prioritization is needed since a user can only examine one document at a time

Ranking is often preferred • A user can stop browsing anywhere, so the boundary is controlled by the user • High recall users would view more items • High precision users would view only a few • Theoretical justification: Probability Ranking Principle [Robertson 77]

Probability Ranking Principle [Robertson 77] • Seek for more fundamental justification • Why is ranking based on probability of relevance reasonable? • Is there a better way of ranking documents? • What is the optimal way of ranking documents? • Theoretical justification for ranking (Probability Ranking Principle): returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions (do they hold?): • The utility of a document (to a user) is independent of the utility of any other document • A user would browse the results sequentially

Two Justifications of PRP • Optimization of traditional retrieval effectiveness measures • Given an expected level of recall, ranking based on PRP maximizes the precision • Given a fixed rank cutoff, ranking based on PRP maximizes precision and recall • Optimal decision making • Regardless the tradeoffs (e.g., favoring high precision vs. high recall), ranking based on PRP optimizes the expected utility of a binary (independent) retrieval decision (i.e., to retrieve or not to retrieve a document) • Intuition: if a user sequentially examines one doc at each time, we’d like the user to see the very best ones first

According to the PRP, all we need is “A relevance measure function f”which satisfiesFor all q, d1, d2, f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2) Most existing research on IR models so far has fallen into this line of thinking…. (Limitations?)

Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fuhr 89) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Modeling Relevance: Raodmap for Retrieval Models Relevance constraints [Fang et al. 04] Div. from Randomness (Amati & Rijsbergen 02) Learn. To Rank (Joachims 02, Berges et al. 05)

Part 2: IR Evaluation

Evaluation: Two Different Reasons • Reason 1: So that we can assess how useful an IR system/technology would be (for an application) • Measures should reflect the utility to users in a real application • Usually done through user studies (interactive IR evaluation) • Reason 2: So that we can compare different systems and methods (to advance the state of the art) • Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users • Usually done through test collections (test set IR evaluation)

What to Measure? • Effectiveness/Accuracy: how accurate are the search results? • Measuring a system’s ability of ranking relevant docucments on top of non-relevant ones • Efficiency: how quickly can a user get the results? How much computing resources are needed to answer a query? • Measuring space and time overhead • Usability: How useful is the system for real user tasks? • Doing user studies

The Cranfield Evaluation Methodology • A methodology for laboratory testing of system components developed in 1960s • Idea: Build reusable test collections & define measures • A sample collection of documents (simulate real document collection) • A sample set of queries/topics (simulate user queries) • Relevance judgments (ideally made by users who formulated the queries)  Ideal ranked list • Measures to quantify how well a system’s result matches the ideal ranked list • A test collection can then be reused many times to compare different systems • This methodology is general and applicable for evaluating any empirical task

Test Collection Evaluation Relevance Judgments Queries Query= Q1 Q1 Q2 Q3 … Q50 ... Q1 D1 + Q1 D2 + Q1 D3 – Q1 D4 – Q1 D5 + … Q2 D1 – Q2 D2 + Q2 D3 + Q2 D4 – … Q50 D1 – Q50 D2 – Q50 D3 + … D2 + D1 + D4 - D5 + Precision=3/4 Recall=3/3 System A D2 D1 + D4 - D3 - D5 + System B Precision=2/4 Recall=2/3 D1 D3 … D48 Document Collection

Measures for evaluating a set of retrieved documents Action Retrieved Not Retrieved Doc Relevant Retrieved a Relevant Rejected b Relevant Irrelevant Retrieved c Irrelevant Rejected d Not relevant Ideal results: Precision=Recall=1.0 In reality, high recall tends to be associated with low precision (why?)

How to measure a ranking? • Compute the precision at every recall point • Plot a precision-recall (PR) curve Which is better? precision x precision x x x x x x x recall recall

Computing Precision-Recall Curve Total number of relevant documents in collection: 10 Precision Recall 1/1 1/10 D1 + D2 + D3 – D4 – D5 + D6 – D7 – D8 + D9 –D10 – 1.0 2/2 2/10 2/3 2/10 0.6 3/5 3/10 4/8 4/10 0.1 0.2 0.3 …. 1.0 ? 10/10

How to summarize a ranking? Total number of relevant documents in collection: 10 Precision Recall 1/1 1/10 D1 + D2 + D3 – D4 – D5 + D6 – D7 – D8 + D9 –D10 – 1.0 2/2 2/10 Average Precision=? 2/3 2/10 0.6 3/5 3/10 4/8 4/10 0.1 0.2 0.3 …. 1.0 0 10/10

Summarize a Ranking: MAP • Given that n docs are retrieved • Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs • E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. • If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero • Compute the average over all the relevant documents • Average precision = (p(1)+…p(k))/k • This gives us an average precision, which captures both precision and recall and is sensitive to the rank of each relevant document • Mean Average Precisions (MAP) • MAP = arithmetic mean average precision over a set of topics • gMAP = geometric mean average precision over a set of topics (more affected by difficult topics) • Which one should be used?

What if we have multi-level relevance judgments? Relevance level: r=1 (non-relevant) , 2 (marginally relevant), 3 (very relevant) Discounted Cumulative Gain Cumulative Gain Gain 3 D1 3 D2 2 D3 1 D4 1 D5 3 D6 1 D7 1 D8 2 D9 1 D10 1 3 3+2/log 2 Normalized DCG=? 3+2 3+2+1 3+2/log 2+1/log 3 … 3+2+1+1 … DCG@10 = 3+2/log 2+1/log 3 +…+ 1/log 10 IdealDCG@10 = 3+3/log 2+3/log 3+…+ 3/log 9+ 2/log 10 Assume: there are 9 documents rated “3” in total in the collection

Summarize a Ranking: NDCG • What if relevance judgments are in a scale of [1,r]? r>2 • Cumulative Gain (CG) at rank n • Let the ratings of the n documents be r1, r2, …rn (in ranked order) • CG = r1+r2+…rn • Discounted Cumulative Gain (DCG) at rank n • DCG = r1 + r2/log22 + r3/log23 + … rn/log2n • We may use any base for the logarithm, e.g., base=b • For rank positions above b, do not discount • Normalized Cumulative Gain (NDCG) at rank n • Normalize DCG at rank n by the DCG value at rank n of the ideal ranking • The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc

Other Measures • Precision at k documents (e.g., prec@10doc): • easier to interpret than MAP (why?) • also called breakeven precision when k is the same as the number of relevant documents • Mean Reciprocal Rank (MRR): • Same as MAP when there’s only 1 relevant document • Reciprocal Rank = 1/Rank-of-the-relevant-doc • F-Measure (F1): harmonic mean of precision and recall P: precision R: recall : parameter (often set to 1)

Challenges in creating early test collections • Challenges in obtaining documents: • Salton had students to manually transcribe Time magazine articles • Not a problem now! • Challenges in distributing a collection • TREC started when CD-ROMs are available • Not a problem now! • Challenge of scale – limited by qrels (relevance judgments) • The idea of “pooling” (Sparck Jones & Rijsbergen 75)

Larger collections created in 1980s Commercial systems then routinely support searching over millions of documents  Pressure for researchers to use larger collections for evaluation

The Ideal Test Collection Report [Sparck Jones & Rijsbergen 75] • Introduced the idea of pooling • Have assessors to judge only a pool of top-ranked documents returned by various retrieval systems • Other recommendations (the vision was later implemented in TREC) • that an ideal test collection be set up to facilitate and promote research; • that the collection be of sufficient size to constitute an adequate test bed for experiments relevant to modern IR systems… • that the collection(s) be set up by a special purpose project carried out by an experienced worker, called the Builder; • that the collection(s) be maintained in a well-designed and documented machine form and distributed to users, by a Curator; • that the curating (sic) project be encouraged to, promote research via the ideal collection(s), and also via the common use of other collection(s) acquired from independent projects.”

TREC (Text REtrieval Conference) • 1990: DARPA funded NIST to build a large test collection • 1991: NIST proposed to distribute the data set through TREC (leader: Donna Harman) • Nov. 1992: First TREC meeting • Goals of TREC: • create test collections for a set of retrieval tasks; • promote as widely as possible research in those tasks; • organize a conference for participating researchers to meet and disseminate their research work using TREC collections.

The “TREC Vision” (mass collaboration for creating a pool) “Harman and her colleagues appear to be the first to realize that if the documents and topics of a collection were distributed for little or no cost, a large number of groups would be willing to load that data into their search systems and submit runs back to TREC to form a pool, all for no costs to TREC. TREC would use assessors to judge the pool. The effectiveness of each run would then be measured and reported back to the groups. Finally, TREC could hold a conference where an overall ranking of runs would be published and participating groups would meet to present work and interact. It was hoped that a slight competitive element would emerge between groups to produce the best possible runs for the pool.” (Sanderson 10)

The TREC Ad Hoc Retrieval Task & Pooling • Simulate an information analyst (high recall) • Multi-field topic description • News documents + Government documents • Relevance criteria: “a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document)” • Each run submitted returns 1000 document for evaluation with various measures • Top 100 documents were taken to form a pool • All the documents in the pool were judged • The unjudged documents are often assumed to be non-relevant (problem?)

An example TREC topic

Typical TREC Evaluation Result Precion-Recall Curve Out of 4728 rel docs, we’ve got 3212 Recall=3212/4728 Precision@10docs about 5.5 docs in the top 10 docs are relevant Breakeven Precision (precision when prec=recall) Mean Avg. Precision (MAP) D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 Denominator is 4, not 3 (why?)

What Query Averaging Hides Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation

Statistical Significance Tests • How sure can you be that an observed difference doesn’t simply result from the particular queries you chose? Experiment 1 Experiment 2 Query System A Query System A System B System B 1 2 3 4 5 6 7 0.20 0.21 0.22 0.19 0.17 0.20 0.21 1 2 3 4 5 6 7 0.02 0.39 0.16 0.58 0.04 0.09 0.12 0.40 0.41 0.42 0.39 0.37 0.40 0.41 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Average 0.20 0.40 Average 0.20 0.40 Slide from Doug Oard

Statistical Significance Testing Query System A Sign Test Wilcoxon System B 1 2 3 4 5 6 7 0.02 0.39 0.16 0.58 0.04 0.09 0.12 + - + - - + - +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 - 0.38 0.76 0.07 0.37 0.21 0.02 0.91 0.46 95% of outcomes p=1.0 Average 0.20 0.40 p=0.9375 0 Slide from Doug Oard Try some out at: http://www.fon.hum.uva.nl/Service/CGI-Inline/HTML/Statistics.html

Live Labs: Involve Real Users in Evaluation • Stuff I’ve Seen [Dumais et al. 03] • Real systems deployed with hypothesis testing in mind (different interfaces + logging capability) • Search logs can then be used to analyze hypotheses about user behavior • The “A-B Test” • Initial proposal by Cutting at a panel [Lest et al. 97] • First research work published by Joachims [Joachims 03] • Great potential, but only a few follow-up studies

What You Should Know • Why is retrieval problem often framed as a ranking problem? • Two assumptions of PRP • What is Cranfield evaluation methodology? • How to compute the major evaluation measures (precision, recall, precision-recall curve, MAP, gMAP, nDCG, F1, MRR, breakeven precision) • How does “pooling” work? • Why is it necessary to do statistical significance test?

Open Challenges in IR Evaluation • Almost all issues are still open for research! • What are the best measures for various search tasks (especially newer tasks such as subtopic retrieval)? • What’s the best way of doing statistical significance test? • What’s the best way to adopt the pooling strategy in practice? • How can we assess the quality of a test collection? Can we create representative test sets? • New paradigms for evaluation? Open IR system for A-B test?

Information Retrieval: Problem Formulation & Evaluation