1 / 18

CHAPTER 2 Information retrieval

DSCI 5240. CHAPTER 2 Information retrieval. Instructor: DR.NICK EVANGELOPOULOS PRESENTED BY: qiuxia wu. Introduction. Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.

moanna
Télécharger la présentation

CHAPTER 2 Information retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSCI 5240 CHAPTER 2Information retrieval Instructor: DR.NICK EVANGELOPOULOS PRESENTED BY: qiuxiawu

  2. Introduction Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.

  3. History of Modern IR For over 4000 years, humans have been designing tools to improve information storage and retrieval. • Vannevar Bush 1945 paper: “As We May Think” The 1stautomated information retrieval systems (1950s and 1960) • SMART (the System for the Manipulation and Retrieval of Text • conceived at Harvard University and flourished at Cornell University • under the leadership of Gerard Salton • the first practical implementation of an IR system • The basic theoretical foundations of SMART still play a major role in today’s IR systems.

  4. Modern Information Retrieval • Document representation • Using keywords • Relative weight of keywords • Query representation • Keywords • Relative importance of keywords

  5. Retrieval Models • Retrieval models match query with documents to: • separate documents into relevant an non-relevant class • rank the documents according to the relevance

  6. Retrieval Models Boolean model Vector space model Probabilistic models

  7. Boolean Retrieval Model One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1

  8. Boolean Model: Advantages • Simplicity and efficiency of implementation • Binary values can be stored using bits • reduced storage requirements • retrieval using bitwise operations is efficient • Boolean retrieval was adopted by many commercial bibliographic systems • Boolean queries are akin to database queries • Bibliographic systems: • database systems, instead of information retrieval systems

  9. Boolean Model: Disadvantages • A document is either relevant or nonrelevant to the query • It is not possible to assign a degree of relevance • Complicated Boolean queries are difficult for users • Boolean queries retrieve too few or too many documents. • K0 and K4 retrieved only 1 out of 6 documents • K0 or K4 retrieved 5 out of a possible 6 documents

  10. Vector Space Model Both the documents and queries as vectors A weight based on the frequency in the document: More sophisticated weighting schemes will be studied later

  11. VSM versus Boolean Model Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/nonrelevant classes Users can go through the ranked list until their information needs are met

  12. Probabilistic Retrieval Model Sparck-Jones (1976): classical probabilistic retrieval model, also known as the binary independence retrieval model Formulates IR in probabilistic framework

  13. Comments on Probabilistic Retrieval Probabilistic independence model is not realistic Two-stage retrieval is more complicated Performance gain over VSM is debatable

  14. Evaluation of Retrieval Performance Precision VS. Recall F-measure Average precision

  15. Precision and Recall

  16. Precision and Recall

  17. F measure

  18. Average Precision

More Related