1 / 14

Ranking models in IR

Ranking models in IR. Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents best satisfy a query If a document talks about a topic more than another, then it is a better match

dannon
Télécharger la présentation

Ranking models in IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking models in IR • Key idea: • We wish to return in order the documents most likely to be useful to the searcher • To do this, we want to know which documents best satisfy a query • If a document talks about a topic more than another,then it is a better match • A query should then just specify terms that are relevant to the information need, without requiring that all of them must be present • Document relevant if it has a lot of the terms

  2. Binary term presence matrices • Record whether a document contains a word: document is binary vector in {0,1}v • What we have mainly assumed so far • Idea: Query satisfaction = overlap measure =

  3. Problems with overlap measure? • It doesn’t consider: • Term frequency (count) in document • Term scarcity in collection (document mention frequency) • Length of documents • (And queries: score not normalized)

  4. Digression: terminology • WARNING: In a lot of IR literature, “frequency” is used to mean “count” • Thus term frequency in IR literature is used to mean number of occurrences in a doc • Not divided by document length (which would actually make it a frequency) • We will conform to this misnomer • In saying term frequency we mean the number of occurrences of a term in a document.

  5. Overlap matching: Normalization • One can normalize in various ways: • Jaccard coefficient: • Cosine measure: • Questions: • What documents would score best using Jaccard against a typical query? • Does the cosine measure fix this problem?

  6. Count term-document matrices • We haven’t considered frequency of a word • Count of a word in a document: • Bag of words model • Document is a vector in ℕv • Raw frequencies below

  7. Bag of words view of a doc • Thus the doc • John is quicker than Mary. is indistinguishable from the doc • Mary is quicker than John. Which of the indexes discussed so far distinguish these two docs?

  8. Weighting term frequency: tf • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc • 1 vs. 2 occurrences • 2 vs. 3 occurrences … • It seems that more is better, but a lot isn’t necessarily better than a few • Can just use raw score • Another option commonly used in practice:

  9. Score computation • Score for a query q = sum over terms t in q: • [Note: 0 if no query terms in document] • Can use wf instead of tf in the above • Still doesn’t consider term scarcity in collection • Does not consider length of document yet

  10. Weighting should depend on the term overall • Which of these tells you more about a doc? • 10 occurrences of hernia? • 10 occurrences of the? • Suggest looking at collection frequency (cf) • But document frequency (df) may be better: Word cf df ferrari 10422 17 insurance 10440 3997 Both “ferrari” and “insurance” occur the same # of time in corpus, but” insurance” occurs in far fewer documents! • Document frequency weighting is only possible in we know (static) collection frequency.

  11. tf x idf term weights • tf x idf measure combines: • term frequency (tf) • measure of term density in a doc • inverse document frequency (idf) • measure of discriminating power of term: its rarity across the whole corpus • could just be raw count of number of documents the term occurs in (idfi = 1/dfi) • but by far the most commonly used version is:

  12. Term Weight:= tf x idf (or tf.idf) • Assign a tf.idf weight to each term i in each document d • Increases with the number of occurrences withina doc • Increases with the rarity of the term acrossthe whole corpus What is the weight of a term that occurs in all of the docs?

  13. Real-valued term-document matrices • Function (scaling) of count of a word in a document: • Bag of words model • Each is a vector in ℝv • Here log-scaled tf.idf Note can be >1!

  14. Review • We compute tf.idf values for each term-doc pair • A doc can be viewed as a vector of tf.idf values • Dimensions = number of terms in lexicon • Thus far, documents either match a query or do not. • But how well does a document match a query? • How can we measure similarity between a query and document? • Gives rise to ranking and scoring • Vector space models • Relevance ranking

More Related