Evaluation of Evaluation in Information Retrieval - Tefko Saracevic

Evaluation of Evaluation in Information Retrieval- Tefko Saracevic Historical Approach to IR Evaluation.

Saracevic’s Definition of Evaluation Evaluation is assessing performance or value of a system, process, procedure, product, or policy.

Evaluation Requirements • A system • A prototype • A criterion or criteria • Objectives of system • Measures • Recall and precision • A measuring instrument • Judgments by analysts/users • Methodology • Procedures, i.e.. for TREC

Levels of Evaluation Engineering level Hardware and Software. Input level Contents of system –coverage. Processing level Questions regarding the way inputs are processed: assessment of algorithms, techniques and approaches.

Levels of Evaluation cont. Output level Interactions with the system and obtained output. Use and user level Applications used for given tasks. Social level Effects on research, productivity and decision making. Eco-efficient level Economic efficiency questions to be determined at each level of analysis.

Two more classes of evaluation. • End user performance and use • Meyer & Ruiz, 1990; others summarized in Dalrymple & Roderer, 1994. • Markets, products, and services from information industry. • Rapp et al., 1990. These evaluations appear regularly in trade magazines such as: Online, Online Review, Searcher, etc... .

Output and user and Use level evaluations • Fenichel (1981) • Borgman (1989) • Saracevic, Kantor, Chamis &Trivison (1990) • Haynes et al. (1990) • Fidel (1991) • Spink (1995)

Processing level: Approaches“Toy Collections” • Cranfeld (Cleverdon, Mills & Keens, 1966) • SMART (Salton 1971, 1989) • TREC (Harmon, 1995)

Studies conducted on the social level: Evaluating impact of IR area specific systems. Impact of MEDLINE on clinical decision making (Lindberg et al., 1993)

Criteria in IR Evaluation • Relevance as core criteria, Kent et. al. 1955. • criteria such as utility and search length did not stick. • Cranfeld, SMART, TREC – all revolved around the phenomenon of relevance. • Keeping evaluation out of engineering level by implications of use. • Relevance is a complex human process – not of a binary nature. • Dependent on circumstances

Output and User and Use level evaluations Employ a multiplicity of criteria. related to utility, success, completeness, worth, satisfaction, value, efficiency, cost etc. . . More emphasis on interaction.

Market, Business, Industry Evaluations Similar to user use level TQM: Total Quality Movement Cost-effectiveness Debate over relevancy is isolated in IR.

Isolation of studies within levels of origin. • Algorithms • Users and Uses • Market products/services • Social Impacts

Process level measures of evaluation • Precision • Ratio of relevant items retrieved to total retrieved items or, probability that a retrieved item is relevant. • Recall • Ratio of relevant items retrieved to all available relevant items in a particular file or, the probability given that an item retrieved will be relevant.

Measures: User Use level. • Semantic differentials • Likert scales Which measures to use? How do measures compare? How do they effect the results? See, Su, 1992

Measuring Instruments Mainly, people, are the instruments that determine relevancy of retrieved items. Who are the judges? What effects their judgments? How do they effect the results?

Methodological issues surrounding notions of validity and reliability. • Collection – How are items selected? • Requests – How are they generated? • Searching – How is it conducted? • Results - How are they obtained? • Analysis – What comparisons are made? • Interpretation / Generalization - What are the conclusions? Are they warranted on basis of results? How generalizable are the findings?

Evaluation outside of traditional IR, i.e. Digital Libraries and the Internet. Evaluation is limited to software and engineering levels. Evaluated on their own level. Many applications are well received, however, on most output, user and use levels these applications are found to be frustrating, unpredictable, wasteful, expensive, trivial unreliable and hard to use!

Don’t through the baby out with the bath water! • Dervin and Nilan, 1986 • Article Swung to the other end of the pendulum and called for paradigmatic shift. • From system centered to user centered evaluations. • Both user and system centered approaches are needed.

Keep it realistic! Possible solution: The integration of all levels of evaluation for a comprehensive “real to life” analysis.

Evaluation of Evaluation in Information Retrieval - Tefko Saracevic