CS 430: Information Discovery

CS 430: Information Discovery Lecture 21 Interactive Retrieval

Course Administration Wireless laptop experiment During the semester, we have been logging URLs used via the nomad proxy server. Working with the HCI Group, we would like to analyze these URLs to study students' patterns of use of online information. The analysis will be completely anonymous. This requires your consent. If you have not signed a consent form, we have forms here for your signature. If you do not sign a consent form, the data will be discarded without being looked at.

The Human in the Loop Return objects Return hits Browse repository Search index

Query Refinement Query formulation and search Display number of hits no hits Reformulate query or display new query Display retrieved information Decide next step reformulate query new query

Reformulation of Query Manual • Add or remove search terms • Change Boolean operators • Change wild cards Automatic • Remove search terms • Change weighting of search terms • Add new search terms

Query Reformulation: Vocabulary Tools Feedback • Information about stop lists, stemming, etc. • Numbers of hits on each term or phrase Suggestions • Thesaurus • Browse lists of terms in the inverted index • Controlled vocabulary

Query Reformulation: Document Tools Feedback to user consists of document excerpts or surrogates • Shows the user how the system has interpreted the query Effective at suggesting how to restrict a search • Shows examples of false hits Less good at suggesting how to expand a search • No examples of missed items

Example: Tilebars The figure represents a set of hits from a text search. Each large rectangle represents a document or section of text. Each row represents a search term or subquery. The density of each small square indicates the frequency with which a term appears in a section of a document. Hearst 1995

Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface From Lecture 9

Theoretically Best Query  optimal query o x x x o o x x x x x x x o x x x o x o x x x x x non-relevant documents o relevant documents

Theoretically Best Query For a specific query, Q, let: DRbe the set of all relevant documents DN-Rbe the set of all non-relevant documents sim(Q, DR) be the mean similarity between query Q and documents in DR sim(Q, DN-R) be the mean similarity between query Q and documents in DN-R The theoretically best query would maximize: F = sim(Q, DR) - sim(Q, DN-R)

Estimating the Best Query In practice, DRand DN-Rare not known. (The objective is to find them.) However, the results of an initial query can be used to estimate sim(Q, DR) and sim(Q, DN-R).

Relevance Feedback (concept)   hits from original search x x o  x x o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query From Lecture 9

Rocchio's Modified Query Modified query vector = Original query vector + Mean of relevant documents found by original query - Mean of non-relevant documents found by original query

Query Modification Q1 = Q0 + Ri - Si n1 n2   i =1 i =1 1 n1 1 n2 Q0 = vector for the initial query Q1 = vector for the modified query Ri = vector for relevant document i Si = vector for non-relevant document i n1 = number of relevant documents n2 = number of non-relevant documents Rocchio 1971

Difficulties with Relevance Feedback    optimal query o x Hits from the initial query are contained in the gray shaded area x x o o x x x x x x x o  x x x o x o x x x x x non-relevant documents o relevant documents  original query reformulated query

Effectiveness of Relevance Feedback Best when: • Relevant documents are tightly clustered (similarities are large) • Similarities between relevant and non-relevant documents are small

Positive and Negative Feedback Q1 =  Q0 + Ri -  Si n1 n2   i =1 i =1 1 n1 1 n2 ,  and  are weights that adjust the importance of the three vectors. If  = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If  = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set.

When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: • Formulate queries thoughtfully with many terms • Review results carefully to provide feedback • Iterate several times • Combine automatic query enhancement with studies of thesauruses and other manual enhancements

CS 430: Information Discovery