Introduction to information retrieval

Introduction to information retrieval Introduction to information retrieval I. What is information retrieval? • What is an IR system? • History of IR II. Basic models and concepts of IR • Classical models • Vocabulary, relevance III. Evaluation of IR systems • Distributed IR • Green IR

I. What is information retrieval? There is a growing mountain of research... “The investigator is staggered by the findings and conclusions of thousands of other workers - conclusions which he cannot find time to grasp, much less remember The summation of human experience is being expanded at a prodigious rate and the means we use for threading through the consequent maze to the momentarily important item is the same that was used in the days of the square rigged ships.” Bush. V. (1945). As we may think. history.sandiego.edu/cdr2/WW2Pics3/58569.jpg

I. What is information retrieval? The memex: “Slanting translucent viewing screens magnifying supermicrofilmfiled by code numbers At left is a mechanism which automatically photographs longhand notes, pictures and letters, then files them in the desk for future reference.” www.boxesandarrows.com/archives/foreseeing_the_future_the_legacy_of_vannevar_bush.php Image: www.nitle.org/etcon/slide0002.htm

I. What is information retrieval? IR deals with the representation, storage, organization of and access to information items The representation and organization of information should provide the user with easy access to the information in which [s]he is interested Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval: Chapter 1: Introduction Addison-Wesley-Longman Publishing Co. IR is a communication process … it is a means by which authors or creators of records communicate with readers, but indirectly and with possibly a long time lag Meadow, Boyce, and Kraft, 2000 www.research.ibm.com/journal/rd/474/bradshaw.html

I. What is information retrieval? IR involves finding some desired information in a store of information or a database Meadow, Boyce, and Kraft, 2000 IR deals with the representation, storage, and access to documents or representatives of documents (document surrogates The input information is likely to include the natural language text of the documents or of document excerpts and abstracts The output or an IR system in response to a search request consists of sets of references Salton and McGill 1983

I. What is information retrieval? Logical view of a document Baeza-Yates and Ribeiro-Neto. www.ischool.berkeley.edu/~hearst/irbook/1/node3.html

I. What is information retrieval? What is information retrieval system? An IR system is a device interposed between a potential user of information and the information collection itself For a given information problem, the purpose of this system is to capture wanted Items and to filter out unwanted items Harter (1998) The task of an IR system is to retrieve documents or texts with information content relevant to a user’s information needs Sparck Jones, K., & Willett, P. (1997). Readings in Information. Retrieval. San Francisco, CA: Morgan Kaufmann.

I. What is information retrieval? IR systems involve document retrieval Indexing: how are documents represented in the system for storage and retrieval? How are queries represented for matching and retrieval? They also involve searching How are query terms related to items in the document? How is the document searched for these items? Most IR work has focused on indexing and searching Goal: to provide access to representations instead of documents

I. What is information retrieval? IR systems involve document retrieval A big change is that systems now provide access to the full text The challenge is the same: match user’s query terms to terms in (or about) the document A related challenge is to evaluate the search returns Recall: relevant items retrieved/all relevant items in DB Precision: relevant items retrieved/all items retrieved Typical results show 30% recall and 30% precision In addition, they are in an inverse relationship

I. What is information retrieval? A: Returned and C: All relevant in DB relevant Recall: A/C D: All returned B: All documents in DB Precision: A/D

I. What is information retrieval? Basic activities User task Initiation: Express information need as a query Knowing scope and contents of DB in relevant domain Retrieval task: translate information need into language of system System terms represent semantics of information need Evaluation task: making judgments about documents and representations returned Difficulty: goals might change during search

I. What is information retrieval? The significance of the Cranfield tests on index languages Cleverdon describes the historical context of early indexing experiments that helped to popularize the term “IR” His work provided evidence that indexers using four types indexing systems performed equally well ~Why was the finding of equivalence among indexing systems controversial? ~What did his experiments tell us about the relationship between recall and precision?

I. What is information retrieval? Post WWII, many US sci-tech reports were declassified The challenge is to provide access to the intellectual content of these documents With the number of documents, conventional indexes and systems (alpha. subject headings) didn’t work Cranfield1 tested Universal Decimal Classification, alphabetical subject index, faceted classification and Uniterm coordinate indexing 18K documents manually indexed four times using 1,200 test questions (with at least one document) Cleverdon, C.W. (1991). The significance of the Cranfield tests on index languages. Proceedings of the 14th annual international ACM SIGIR Conference on Research and Development in Information Retrieval.

I. What is information retrieval? Cleverdon used relevance judgments to measure precision and recall (note the trade-off between them) Search results Precision Recall www.sigir.org/awards/awards.html

I. What is information retrieval? Cranfield1 showed that automated and manual methods were about the same (74-82% efficiency) Also recall and precision could not be simultaneously improved in the same system Cranfield2 measured IR effectiveness Intended to test indexing languages (as precision/recall devices) in isolation Hypothesis: adding these devices would improve IR over use of single natural language terms This wasn’t supported but the methods for evaluating IR systems are widely used (Salton, Sparck-Jones)

I. What is information retrieval? The history of information retrieval research The authors provide a brief history of IR They point out significant developments in the transition from early work with managed collections to current challenges of web-based searching In doing so, they mention many of the key researchers and their contributions to IR research ~ What was the most surprising development? ~ What do you see as the future of IR?

I. What is information retrieval? The history of IR does not begin with the internet From librarianship, books or papers were indexed using cataloguing schemes 1918: Soper patented a device where cards with holes, related to categories, were to determine if there were entries in a collection with a combination of categories If light could be seen through the arrangement of cards, a match was found Mechanical devices that searched a catalogue for an entry were also devised Sanderson, M. and Croft, B. (2012). The History of Information Retrieval Research. Proceedings of the IEEE

I. What is information retrieval? 1940s: The earliest computer-based searching systems 1948: Holmstrom described “the Univac,” capable of searching for text references associated with a subject code It could process “at the rate of 120 words per minute,” the first reference to a computer search for content 1958: Mooers used the term IR at International Conf. on Scientific Information in 1958 “The problem under discussion here is machine searching and retrieval of information from storage according to a specification by subject…” web.utk.edu/~alawren5/mooers.html

I. What is information retrieval? Technological change spurred change in IR in the 1950s, One system from GE searched over 30,000 document abstracts IR research was starting to emerge with two challenges: how to index documents and how to retrieve them Indexing: Cleverdon’s experimental results were found to be correct and as a result the use of words to index the documents of an IR system became established Retrieval: boolean : a query is a logical combination of terms which returns a set of documents exactly matching the query

I. What is information retrieval? Retrieval: ranked retrieval: each document in a collection is assigned a score indicating its relevance to a given query These scores were based on a probabilistic approach Retrieval: term frequency weighting: the frequency of word occurrence furnishes a useful measurement of word significance The capabilities of IR systems grew with increases in processor speed and storage capacity The number of bits of information packed into a square inch of hard drive surface grew from 2,000 bits in 1956 to 100 billion bits in 2005

I. What is information retrieval? Sparck-Jones: the first phase of IR was between 1955-75 Key ideas and techniques were being proposed, tested and validated Post-coordinate systems create separate entries for each concept in an item where an item can be retrieved with any combination of those concepts in any order Descriptors form an indexing vocabulary and, as a set, thesauri Are domain specific thesauri better than a universal system? Insight: automate the indexing process www.cis.upenn.edu/ghls/newGHLS.htm

I. What is information retrieval? 1960s shift from asking if IR was possible on computers to determining means of improving IR systems Salton: formalization of algorithms to rank documents relative to a query Switzer: vectors documents and queries were viewed as vectors within an N dimensional space (N =# of unique terms in the collection being searched) Relevance feedback: supports iterative search, where items previously retrieved could be marked as relevant Queries were automatically adjusted using information extracted from the relevant documents

I. What is information retrieval? Other IR enhancements Clustering of documents with similar content Statistical association of terms with similar semantic meaning Increasing the number of documents matching a query by expanding the query with lexical variations (stems) or with semantically associated words In this decade, commercial search companies emerged 1966: Dialog came from the creation of NASA IR system Low level of interaction between commercial and IR research communities

I. What is information retrieval? 1970’s: term frequency weighting based on word occurrence in documents Sparck-Jones: frequency of word occurrence in a collection is inversely proportional to its significance in retrieval Less common words refer to more specific concepts, and are more important in retrieval Salton's vector space model is still in use 1980-90s: development of Latent Semantic Indexing Based on the principle that words that are used in the same contexts tend to have similar meaning

I. What is information retrieval? LSI extracts the conceptual content of text by finding associations between terms occurring in similar contexts Donna Harman and colleagues formed TREC (Text REtrievalConference) International research groups collaborate to build large test collections for experimentation 1990s-present: the web raised new problems for IR Two important developments: link analysis and searching of anchor text Searching the content of a web page and the text of links pointing (anchoring) to that page (Page Rank)

I. What is information retrieval? Also: mining query logs, social search, involving communities of users and informal information exchange Also: user tagging, conversation retrieval, filtering and recommendation, and collaborative search Starting to provide effective new tools for managing personal and social information Interaction between commercial and research oriented IR communities became stronger Lots of movement between search companies and academic researchers Increased Federal funding for IR research

II. Basic models and concepts of IR Classic IR models assume that documents can be described by representative keywords These index terms summarize the content (meaning) of the document Problem: not all index terms are equally useful Some are too vague for a single document Some are common and are in too many documents Terms have varying relevance when used to describe a document and must be weighted Challenge: to develop a reliable weighting algorithm

II. Basic models and concepts of IR IR terms Document surrogate: the part of the document that is the input to the IR system The complete document, one or more parts: title, abstract or table of contents, or set of keywords Document representation: a summary version of the original document represented in the index language Return set: the references presented to the user by the system Can include the bibliographic reference, parts of the representation, the surrogate or complete document

II. Basic models and concepts of IR IR is a classificatory activity Indexing sorts files into groupings to match future queries Searching sorts into matching and non-matching Categorization: assigning incoming files to appropriate headings Filtering/routing: sending files to appropriate end users Extraction: returning and displaying predetermined information about a document Summarizing: representing the file with an abstract

II. Basic models and concepts of IR The basic IR process Actually three processes Indexing: preparing the documents and representations Matching: comparing the query to the representation in the document index Retrieval: returning documents in response to a query 144.16.72.189/netlis/ wise/search/search.html

II. Basic models and concepts of IR Classical boolean models Based on set theory and “boolean algebra” Assumes an index term is present or absent A A B A and B but not C C

II. Basic models and concepts of IR Vector space model: documents/queries are represented as vectors in a vector space defined by index terms Uses variable term weighting schemes Does not rely on binary weightings (like boolean) This allows calculation of degree of similarity between document and query Involves clustering documents into sets that match queries and sets that don’t Allows partial matching and sorting in terms of decreasing similarity Should be more precise than Boolean searching

Cat Doc 3 Doc 1 Lion Doc 2 Dog II. Basic models and concepts of IR Vector-based IR How well does a document match a query? pi0959.kub.nl/Paai/Onderw/V-I/Content/history.html

II. Basic models and concepts of IR Each index word’s weight is calculated across the entire document set showing the word’s importance in the set Each index word’s weight in a given document is calculated for all documents in the set where it appears This shows how important the word is in a single document In a search, the query vector is compared to every one of the document vectors The results are ranked This shows which document comes closest to the query, and ranks the others by closeness of fit

II. Basic models and concepts of IR IR challenges What parts of documents should analyzed and which semantic, syntactic or other features should be studied? The parts have to be translated into a consistent description or representation in the IR system They must accessible to that part of the system that accepts and translates the query of the use What are the semantic “units of description” in the index language and what syntactic features should they have? What syntactic and semantic devices are in the index language to manipulate descriptions during search?

II. Basic models and concepts of IR • User-defined relevance criteria in web searching • Savolainen and Kari describe research that focuses on the criteria we use to determine which of items returned in search engine result sets we will investigate • We make a relevance judgment about the link according to a set of eighteen criteria and then make a decision to follow a link or not. • ~What are the criteria that you use when you decide to follow a result from a web search? • ~What criteria do you use when you decide to reject a web page after clicking on a search engine result?

II. Basic models and concepts of IR • Focus: making relevance judgments during everyday life information seeking when searching the web • What are the criteria we use to define relevance? • Can criteria from online searching work in web searching? • How do we judge the quality of links and web pages they lead to? Used a talk-aloud protocol to gather data about decisions made when evaluating search engine results Finding: a set of user-defined criteria to judge relevance • Savolainen, R. and Kari, J. (2006). User-defined relevance criteria in web searching. Journal of Documentation, 62(6), 685-707

II. Basic models and concepts of IR The process begins with a task at hand or interest in an issue or topic Salvolainen and Kari (2006; 686)

II. Basic models and concepts of IR The typical criterion is topicality Examples of other criteria Depth/scope/specificity Accuracy/validity Clarity Currency Topicality Accessibility Variety Familiarity Curiosity Availability of information/sources of information N=9 people conducting web searches of their own choosing 43% of all judgments were to accept, 57% to reject

II. Basic models and concepts of IR When deciding to view a web page the top criteria were Specificity, topicality, variety, familiarity, novelty When deciding to reject a page, the top criteria were Lack of specificity, insufficient accessibility, not able to understand, insufficient clarity Conclusion: many of the criteria used in traditional IR searching are applicable to the web environment Also: the significance of relevance criteria changes as the web search evolves

III. Evaluation of IR systems Spark-Jones: it involves a set of compromises A practical problems: how relevant is any document to any user need (which changes over time)? Also the context of system use matters One common method is systems based Comparative performance of indexing languages, different hardware configurations Given a known document set, how well did the system perform: TREC Goal is to understand how the system works to improve performance

III. Evaluation of IR systems Another is to measure user-based variables Satisfaction, relevance feedback, time on task, efficiency in retrieval (subjective relevance) Began with experimental evaluations in labs and leads to a tradeoff between control and realism Then effectiveness, efficiency and acceptability Recall/precision could measure effectiveness Then field studies of IR system use Now the web raises new problems for IR researchers Challenge: to combine measures of utility for end users with system effectiveness

III. Evaluation of IR systems Another is to measure variables in the search process itself and the tasks that bring people to the system Possibilities: measures of the user-system performance # of tasks completed # of query terms entered # of commands used # of cycles or query reformulations # of errors and time taken Subjective evaluation of satisfaction with system Amount of effort required

III. Evaluation of IR systems Standard evaluation criteria Effectiveness System based, can involve the user Often using a test collection Precision and recall Relevance judgments Efficiency Retrieval time, indexing time, size of index Usability Ease of use and of learning

III. Evaluation of IR systems People can judge the relevance of IR systems Johnson, Griffiths, and Hartley. (2003). Task dimensions of user evaluations of information retrieval systems. Information Research, 8(4) informationr.net/ir/8-4/paper157.html

III. Evaluation of IR systems To what problem is distributed information retrieval the solution? The author critical examines the assumption that searching based on distributed information retrieval is an advance over standard IR searching Explaining how the brokered model of distributed IR works, he challenges claims of improved coverage, improved results and ease of use. ~Do you find his argument persuasive? Why? ~What advantages and disadvantages do you notice when using distributed IR search engines?

III. Evaluation of IR systems Distributed information retrieval: combines several independent search engines into a single interface A single broker runs parallel and simultaneous searches on multiple engines Coordinates retrieval from many independent search services Presents a single set of results Question: when is this type of IR useful? Is it an improvement over conventional systems? Thomas, P. (2012). To what problem is distributed information retrieval the solution? Journal of the American Society for Information Science and Technology, 63(7), 1471-1476

III. Evaluation of IR systems Tasks of distributed IR Discovering available sources Characterizing these by language, topic, or other attributes Routing a user’s query Translating between broker’s and sources’ query languages Parsing results from each engineand merging results into a single listand presenting that list Uncooperative model uses a standard interface

Introduction to information retrieval

Introduction to information retrieval

Presentation Transcript

Introduction to Information Retrieval (IR)

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval (Part 2)

Introduction to Information Retrieval

CSE484 Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

BIM490 Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval