The Information Retrieval Problem

The Information Retrieval Problem • The IR problem is very hard • Why? Many reasons, including: • Documents are not (very) structured • Database searches vs document base searches • Language is not (very) cooperative • DNA: microbiology or DEC Network Architecture? • Free rider: game theory or urban transportation systems? • Corporate memory or organizational memory? • Physical access vs logical access • Physical: relatively easy • Logical: terribly difficult Information Retrieval

The Information Retrieval Problem • Kinds of information searches • Framework from David Blair • Search exhaustivity makes it difficult to determine whether all relevant documents were retrieved • Data base size as a framework for text retrieval Systems ( greater than 250,000 pages of text ) • Distinctions • Large vs small (document) data bases • Exhaustive vs sample searches • Content vs context searchesBlair and Maron 1985 vs left hand side of page in the middle of a red book Information Retrieval

The Information Retrieval Problem: Basic IR Technology • Your basic IR technology • Full text or keyword retrieval, with • Boolean combinations and • Location indicators • Full text--has everything • Or does it? • Keyword indexing • Requires work • Boolean combination of words • Usual Boolean operators: AND, OR, NOT • This is a logically complete set Information Retrieval

Web Search Engines - Indexing retrieval algorithms • Manual indexing along common themes www.yahoo.com • Weight each word numerically (eliminate common words such as of, that, and, etc.) • Some weight words in the <head> section or in the URL higher. • Some weight order of the first word in the query higher than the second and so on. • Retrieve all documents that match the query (typically a Boolean query) • Count frequency of word occurrences (The Stroud Corporation example: publishers “game” the indexing algorithm) • Add up word weights for document reflecting the word frequency • Search engines do not index words in graphics (gif and jpg files) • Infoseek, Lycos and Yahoo offer multilingual queries Information Retrieval

Web Search Engines - Metasearches Advantages • Query is sent to multiple search engines simultaneously • Results are grouped, aggregated, and sorted with duplicates removed • Often adds new metatitles to help categorize the sites Disadvantages • Returns much less information about each site • Omits unique sites only found by particular nuances of a particular query engine • It is very difficult to formulate complex queries Examples • www.inference.com • www.web-search.com/savvy.html/ Information Retrieval

The Information Retrieval Problem:Probability of Retrieving a Relevant Document P(word1) = .6 probability searcher uses word1 in a query P(word2) = .5 probability searcher uses word2 in a query P(Doc_word1) = .7 probability word1 is in relevant document P(Doc_word2) = .6 probability word2 is in relevant document The probability of searcher using word1 in a query and word1 being in a relevant document is P(word1) x P(Doc_word1) = .6 x .7 = .42 The probability of searcher using word1 in a query and word1 being in a relevant document is P(word2) x P(Doc_word2) = .5 x .6 = .30 The probability of searcher using word1 and word2 in a query and both word1 and word2 being in a relevant document is P(word1) x P(Doc_word1) x P(word2) x P(Doc_word2) = .6 x .7 x .5 x .6 = .126 Information Retrieval

The Information Retrieval Problem:Basic IR Technology Recall measures how well all relevant documents are retrieved ( x / n2 ) Precision measures how well only relevant documents are retrieved ( x / n1 ) Information Retrieval

relevant retrieved relevant and retrieved not relevant not retrieved The Information Retrieval Problem:Basic IR Technology • When and where and how does the recall vs precision distinction matter? • How well does full text retrieval work? Information Retrieval

The Information Retrieval Problem:Summary of Blair and Maron Study • Searcher perception that their search was exhaustive (recall > 75%) actual recall 20% • No significant difference between searching ability of lawyer or paralegal • Searchers were only able to anticipate a small number of words and phrases that could be used to retrieve relevant documents and would not be in irrelevant documents • Extraordinary and unpredictable variability in the words and phrases used to discuss the same topics (e.g., the accident in the litigation referred to as situation, difficulty, event, what happened last week, and we all know why we are here ) Information Retrieval

The Information Retrieval Problem