Ch-11 Relevance Feedback and Other Query Modification Techniques

Ch-11Relevance Feedback and Other Query Modification Techniques

CONTENTS • Abstract • Introduction • Research in relevance feedback and query modification • Recommendations for use of relevance feedback • Some thoughts on the efficient implementation

Abstract This chapter presents a survey of relevance feedback techniques , various query modification approaches and some guidelines for the efficient design of the relevance feedback component.

Introduction • Information retrieval systems have a limited recall; users may retrieve a few relevant document but never all relevant documents. • Users seldom have many ways to retrieve more relevant documents where high recall is critical. • As a first choice they can “expand“ their research by broadening a narrow Boolean query or by looking further down a ranked list of retrieved documents

A broad Boolean search pulls in too many unrelated documents and the tail of the ranked list of documents contains documents matching mostly less discriminating query terms. • The limits of providing increasingly better ranked results based solely on the initial query indicate a need to modify that query to further increase performance.

Two components of relevance feedback have evolved in research--: • First, extensive work has been done in the reweighting of query terms based on the distribution of these terms in the relevant and non relevant documents retrieved in response to those queries. This work forms the basis of the probabilistic model for ranking. • A second component of relevance feedback or query modification is based on changing the actual terms • in the query.

RESEARCH IN RELEVANCE FEEDBACK AND QUERY MODIFICATION Early Research Relevance feedback was the subject of much experimentation in the early SMART system, experiments in query modification that combined term reweighting and query expansion. Based on what is now known as the vector space model .it is defined by -: where Q0 = the vector for the initial query Ri = the vector for relevant document i Si = the vector for non relevant document i n1= the number of relevant documents n2 = the number of non relevant documents

Evaluation of Relevance Feedback • Standard evaluation in information retrieval compares recall-precision figures generated from averaging the performance in individual queries and comparing this averaged performance across different retrieval techniques. • A significant part of this improvement results from the relevant documents used to reweight the query terms moving to higher ranks. • The probabilistic weighting schemes provide a useful method for relevance feedback, especially in the area of term reweighting.

Query Expansion without Term Reweighting • If a query has retrieved no relevant documents, it becomes critical to expand the query. • The early SMART experiments both expanded the query and reweighted the query terms by adding the vectors of the relevant and non relevant documents. • Query expansion should be done using a thesaurus that adds synonyms, broader terms, and other appropriate words. • Many attempts have been made to automatically create one. Most of these involve term-term associations or clustering techniques.

Term-term clustering using the SMART system cause little overall improvement, but that improvement is only in the precision of the search rather than in increasing recall. • Harman (1988) used a variation of term-term association, adding only the top connected terms and no high-frequency terms. • Improvements were only 8.7 percent using the automatically indexed Cranfield collection. By comparison, applying a similar user filtering process with a selection of terms from relevant documents provided an improvement of over 16 percent, almost twice that for term-term clustering techniques. • Dumais (1990) used an elaborate factor analysis method called latent semantic indexing to expand queries.

Query Expansion with Term Reweighting • The early SMART experiments (see section 11.2.1) added term vectors to effectively reweight the query terms and to expand the query. • Using all relevant document information for weighting, and expanding the query by adding all terms directly connected to a query term by a maximum spanning tree (MST) technique of term-term clustering was investigated • Using only 10 or 20 documents for feedback, the EMIM reweighting with expansion still showed significant improvements . • Cranfield 1400 collection with only titles for documents were used as a test bed.

The expansion using the MST, Harper tried expanding queries using a selection of terms from retrieved relevant documents. • He selected these terms by ranking a union of all terms in the retrieved relevant documents using the EMIM measure, and then selecting a given number from the top of this list. He found significant performance improvements . • Wu and Salton (1981) experimented with term relevance weighting ( precision weighting), a method for reweighting using relevance feedback. They tried reweighting and expanding the query by all the terms from relevant documents.

They found a 27 percent improvement in average precision for the small (424 document) Cranfieldcollection using reweighting alone, with an increase up to 32.7 percent when query expansion was added to reweighting.

RECOMMENDATIONS FOR USE OF • RELEVANCE FEEDBACK • Relevance Feedback Methodologies • Often users are only interested in an "answer" to their question (such as a paragraph in an online manual), or in a single good bibliographic reference for introduction to a new area of interest. • Three basic retrieval systems are being addressed here: • Boolean based systems • Systems based on ranking using a vector space model and • systems based on ranking using either an adhoccombination of term-weighting schemes or using the probabilistic indexing methods.

In terms of data characteristics, two characteristics have been found to be important experimentally: • the length of the documents (short or not short), and • the type of indexing (controlled or full text).

SOME THOUGHTS ON THE EFFICIENT • IMPLEMENTATION OF RELEVANCE FEEDBACK • OR QUERY MODIFICATION • Because of the lack of use of relevance feedback in operational systems in general, little work has been done in achieving efficient implementations for large data sets of the recommended feedback algorithms discussed here. • The first part of the section lists the data and data structure needs for relevance feedback, with discussions of alternative methods of meeting these needs, and • the second part of the section contains a proposal expanding the basic ranking system.

Data and Structure Requirements for Relevance • Feedback and Query Modification • The major data needed by relevance feedback and other query modification techniques is a list of the terms contained in each retrieved document. • For small collections, lists of the terms within each document can be kept. • For larger data sets, an alternative method would be to parse the retrieved documents in the background while users are looking at the document titles. • If some type of automatic thesaurus method is to be used in query expansion, then additional storage may be needed.

A Proposal for an Efficient Implementation of Relevance Feedback This proposal is based on the implementation of the ranking system. Two further modifications are necessary to the basic searching routine to allow feedback. First, the weights stored in the postings must be only the normalized document frequency weights. The second modification is not strictly necessary but probably would ensure adequate response time. The basic retrieval algorithm is time-dependent on the number of query terms.

THANK YOU

Ch-11 Relevance Feedback and Other Query Modification Techniques