1 / 82

The Principle of Information Retrieval

The Principle of Information Retrieval. Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011. II 课程内容. 6 Query expansion and relevance feedback. Query refining. Query refining Query expansion Query reformulation

dlinda
Télécharger la présentation

The Principle of Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Principle of Information Retrieval Department of Information ManagementSchool of Information EngineeringNanjing University of Finance & Economics 2011

  2. II 课程内容

  3. 6 Query expansion and relevance feedback

  4. Query refining • Query refining • Query expansion • Query reformulation • Why use query refining? • Synonymy • Personalization • …

  5. Two types • Global methods • Expanding or reformulating query terms independent of the query • Local methods • Adjusting a query relative to the documents that initially appear to match the query

  6. The types of Global methods • Query expansion/reformulation with a thesaurus • WordNet • Automatic thesaurus generation • Techniques like spelling correction

  7. The type of Local methods • Relevance feedback • Pseudo-relevance feedback, also known as Blind relevance feedback • (Global) indirect relevance feedback

  8. 6.1 Global methods for query reformulation

  9. 6.1.1 Query reformulation • Users give additional input on query words or phrases, which possibly suggested by the IR system • The key is building a thesaurus for query reformulation • Use of a controlled vocabulary that is maintained by human editors • Library of Congress Subject Headings • The Dewey Decimal system • An automatically derived thesaurus with word co-occurrence statistics • Query reformulations based on query log mining

  10. 6.1.2 Methods of query reformulation • Vocabulary tools • Automatic thesaurus generation

  11. 6.1.2.1 Vocabulary tools for query reformulation • By means of a thesaurus or a controlled vocabulary • This includes information about • Words that were omitted from the query • Words that were stemmed to • The number of hits on each term or phrase • Whether words were dynamically turned into phrases

  12. WordNet

  13. Sogou vocabulary

  14. The advantage • Not requiring any user input • Some system can do automatic query expansion with thesaurus • In PubMed system, neoplasm was added to a search for cancer automaticly • Increases recall • Widely used in many science and engineering fields

  15. 6.1.2.2 Automatic thesaurus generation

  16. Automatically generated thesaurus

  17. Methods • Exploit word cooccurrence with text statistics to find the most similar words • Feasible and common • Use a grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies • Advanced but complicated

  18. Computation of co-occurrence thesaurus • We begin with a term-document matrix A, where each cell At,d is a weighted count wt,d for term t and document d • If we then calculate C = AAT, then Cu,v is a similarity score between terms u and v, with a larger number being better

  19. Computation of co-occurrence thesaurus

  20. Computation of co-occurrence thesaurus

  21. Computation of co-occurrence thesaurus

  22. Computation of co-occurrence thesaurus

  23. The disadvantages • Tremendous computation • Require dimensionality reduction via Latent Semantic Indexing • Require domain specific thesaurus • The quality of the associations • Term ambiguity easily introduces irrelevant statistically correlated terms • Apple computer may expand to Apple red fruit computer • Not retrieve many additional documents • Since the terms in the automatic thesaurus are highly correlated in documents anyway

  24. 6.2 Relevance feedback • Relevance feedback is one of the most used and most successful approaches

  25. The base idea • It may be difficult to formulate a good query when you don’t know the collection well • Seeing some documents may lead users to refine their understanding of the information they are seeking

  26. The approach of RF • The user issues a (short, simple) query • The system returns an initial set of retrieval results • The user marks some returned documents as relevant or not relevant • The system computes a better representation of the information need based on the user feedback • The system displays a revised set of retrieval results

  27. An example of RF • Image search provides a good example of relevance feedback, which is a domain where a user can easily have difficulty formulating what they want in words, but can easily indicate relevant or nonrelevant images • http://nayana.ece.ucsb.edu/imsearch/imsearch.html

  28. Instructions • Browse: If the first page displayed doesn't include any interesting images, click browse to see the next page • Search: Once you find some initial images you are interested, click on them to select and press search • Iterate: After the search results are displayed, select/unselect more relevant images and click search • The system is based on relevance feedback and it learns while you select more images and iterate

  29. 6.2.1 The Rocchio algorithm for RF • Theclassic algorithm • The models based on VSM • Relevance feedback can improve both recall and precision • But, in practice, it has been shown to be most useful for increasing recall

  30. The underlying theory1-5 • We want to find a query vector, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents

  31. The underlying theory2-5 • If Cr is the set of relevant documents and Cnr is the set of nonrelevant documents, then we wish to find:

  32. The underlying theory3-5 • Under cosine similarity, the optimal query vector for separating the relevant and nonrelevant documents is:

  33. The underlying theory4-5

  34. The underlying theory5-5 • The optimal query is the vector difference between the centroids of the relevant and nonrelevant documents • The key is getting the full set of relevant documents and the full set of nonrelevant documents based on users’ feedback

  35. Rocchio algorithm • q0 is the original query vector • Dr and Dnr are the set of known relevant and nonrelevant documents respectively • α,β and γ are weights attached to each term

  36. Rocchio algorithm • Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems set γ < β • Reasonable values might be α = 1, β = 0.75, and γ = 0.15

  37. Ide dec-hi • Another alternative is to use only the marked nonrelevant document, which has the most consistent perform

  38. The first assumption of RF • The user has to have sufficient knowledge to be able to make an initial query which is at least somewhere close to the documents they desire • Cases where relevance feedback alone is not sufficient include: • Misspellings • Mismatch of searcher’s vocabulary versus collection vocabulary • Laptop VS. notebook computer • Cross-language information retrieval • Documents in the same language cluster more closely together

  39. The second assumption of RF1-3 • The term distribution in all relevant documents will be similar to that in the documents marked by the users • The term distribution in all nonrelevant documents will be different from those in relevant documents

  40. The second assumption of RF2-3 • This approach does not work well if the relevant documents are a multimodal class • Subsets of the documents using different vocabulary, such as Burma vs. Myanmar • A query for which the answer set is inherently disjunctive, such as Pop stars who once worked at Burger King • Instances of a general concept, which often appear as a disjunction of more specific concepts • For example, felines

More Related