1 / 60

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #3 February 1, 2000. This lecture. Lexis-nexis demo Recall and precision Effect of query terms on recall/precision Effect of indexing on recall and precision Zipf’s law and its applications. Searches the retrieved subset.

rdennis
Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #3 February 1, 2000

  2. This lecture • Lexis-nexis demo • Recall and precision • Effect of query terms on recall/precision • Effect of indexing on recall and precision • Zipf’s law and its applications

  3. Searches the retrieved subset

  4. The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgements • Same users may judge differently at different times • Degree of relevance of different documents will vary

  5. The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

  6. Finding the relevant set • In a small collection - the relevance of each document can be checked • In a large collection - a sample of the documents is checked and the number of document in the retrieved set is estimated • On the WWW?

  7. Finding the relevant set in TREC • Each query is run by all competitors • Top N documents retrieved by each merged and checked manually • Set of relevant documents found is considered the relevant set

  8. Recall and precision • Most commonly used measures for evaluating an IR system • Given a DB and a query Q

  9. Example 1 rl=5 Relevant rt=8 Retrieved both rr =2 N

  10. Recall • Recall R = rr / rl where • rl - no. relevant documents in DB • rr - no. relevant retrieved documents • Fraction of relevant setwhich is retrieved • For example 1: R=2/5=.4

  11. Precision • rt - no. documents retrieved for Q • Precision P = rr / rt • The fraction of the retrieved set which is relevant • For example 1: P=2/8=0.2

  12. Recall and precision • Ideal retrieval results: 100% recall and 100% precision • All good documents are retrieved and • No bad document is retrieved

  13. Recall/precision graph Precision Ideal 0 0.1 0.2 1 Recall 0 0.1 0.2 1

  14. Choosing query terms • Subject: Information retrieval • Initial Query: Information and retrieval • Broader query: Information or retrieval • Narrower: Information adjacent Retrieval

  15. Effect of query terms on results • Broad query - high recall but low precision • Narrow query - high precision but low recall

  16. Indexing Effectiveness • Indexing exhaustively and • Term specificity

  17. Exhaustively • An index is exhaustive when all content is included • Very few index terms • Information loss • Decreases recall

  18. Exhaustively • Exhaustive index • Increases output • Decreases precision • Increases recall

  19. Specificity • Specificity - breadth or narrowness of terms • Breadth plus to recall and minus to precision • Narrowness plus to precision and minus to recall

  20. The Trade Off Precision Narrowterms Broadterms 0.5 Recall 0.5

  21. Zipf’s law and its applications • Estimating storage space saved by excluding stop words from index • 10 most frequently occurring words in English » 25%-30% of text

  22. Zipf’s law and its applications • Estimating the size of a term’s inverted index list • Given the rank r of the term in English, • N the number of words in the database • A the constant for the database • Size of inverted index list n » A*N/r

  23. Zipf’s law and its applications • Estimating the number of words n(1)that occur 1 times, n(2)that occur 2 times, etc • Words that occur at most twice about 2/3 of a text • Deleting very low frequency words from index - large saving

  24. Term frequency predictions • Rank words by their frequency of occurrence in English • 1 - most frequent word, and • t - number of distinct terms/last rank • Table in next slide based on a text with 1 million words shows the 10 most frequent words and their frequency of occurrence

  25. Most frequent words r Word f(r) r*f(r)/N 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,049 0.081 9 was 9,816 0.088 10 he 9,543 0.095 N~1000,000

  26. Observing the numbers • “the” and “of” » 10% of text • All 10 words » 25% of text • f(r)/N » probability of occurrence of a term with rank r in the text • Note that r*f(r)/N is » 0.1

  27. Zipf’s law • If we rank the occurrence of all terms in English we will find that r*p(r)»A where • r denotes the rank of a word, and • p(r) the probability of occurrence of the word and • A is a constant

  28. Zipf’s observations • Most frequent English words are short • Least frequent longest • Average length of distinct English words is 8.1 characters, but • Average word length of all word occurrences is only about 4.7

  29. Zipf’s law for a given text • Given a text of N words, • r » A*N/f(r) where • A is domain specific constant, and • f(r) is the number of occurrences of the term with rank r

  30. Distinct words occurring j times • Text has t distinct words ranked 1 to t • max - is the maximum number of occurrences of a word • n(j) - number words occurring j times • r(j) - last rank of terms occurring j times

  31. Rank Frequency Number words 1 max ... ... n(max)=last(max) last(max) max ... … ... … last(j+1) j+1 … j ... n(j)=last(j)-last(j+1) last(j) j last(2) 2 ... 1 n(1)=last(1)-last(2) last(1)=t 1

  32. Distinct words occurring j times • last(j) highest rank of terms occurring at least j times • last(j+1) highest rank of terms occurring at least j+1 times • n(j) number of words occurring j times

  33. Distinct words occurring j times • Using Zipf’s law: last(j) » A*N/j and last(j+1) » A*N/(j+1) and last(1)=t »A*N/1

  34. Distinct words occurring j times n(j)=last(j)-last(j+1) » » A*N/j-A*N/(j+1) » » A*N/(j(j+1)) » » t/(j(j+1)) (see A*N » t in previous slide)

  35. Distinct words occurring once, twice... • About half of the words in the text occur one • n(1)/t » 1/(1*2) » 0.5, • About 2/3 of the distinct words in the text occur at most twice • n(2)/t » 1/(2*3) » 0.167, etc. • n(1)+n(2) » .667

More Related