1 / 21

Information Retrieval and Web Search

Statistical Properties of Text. How is the frequency of different words distributed?How fast does vocabulary size grow with the size of a corpus?Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system. . Word

jariah
Télécharger la présentation

Information Retrieval and Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Information Retrieval and Web Search Text properties Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan at U.Massachusetts, Amherst)

    2. Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow with the size of a corpus? Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.

    3. Word Frequency A few words are very common. 2 most frequent words (e.g. the, of) can account for about 10% of word occurrences. Most words are very rare. Half the words in a corpus appear only once, called hapax legomena (Greek for read only once) Called a heavy tailed distribution, since most of the probability mass is in the tail

    4. Sample Word Frequency Data (from B. Croft, UMass)

    5. Zipfs Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) discovered that: If probability of word of rank r is pr and N is the total number of word occurrences:

    6. Zipf and Term Weighting Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

    7. Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank rn=AN/n Several words may occur n times, assume rank rn applies to the last of these. Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times. So, the number of words appearing exactly n times is:

    8. Predicting Word Frequencies (contd) Assume highest ranking term occurs once and therefore has rank D = AN/1 Fraction of words with frequency n is: Fraction of words appearing only once is therefore .

    9. Occurrence Frequency Data (from B. Croft, UMass)

    10. Does Real Data Fit Zipfs Law? A law of the form y = kxc is called a power law. Zipfs law is a power law with c = 1 On a log-log plot, power laws give a straight line with slope c. Zipf is quite accurate except for very high and low rank.

    11. Fit to Zipf for Brown Corpus

    12. Mandelbrot (1954) Correction The following more general form gives a bit better fit:

    13. Mandelbrot Fit

    14. Explanations for Zipfs Law Zipfs explanation was his principle of least effort. Balance between speakers desire for a small vocabulary and hearers desire for a large one. Li (1992) shows that just random typing of letters including a space will generate words with a Zipfian distribution. http://linkage.rockefeller.edu/wli/zipf/

    15. Zipfs Law Impact on IR Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs. Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

    16. Exercise Assuming Zipfs Law with a corpus independent constant A=0.1, what is the fewest number of most common words that together account for more than 25% of word occurrences (I.e. the minimum value of m such as at least 25% of word occurrences are one of the m most common words)

    17. Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

    18. Heaps Law If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: K ? 10?100 ? ? 0.4?0.6 (approx. square-root)

    19. Heaps Law Data

    20. Explanation for Heaps Law Can be derived from Zipfs law by assuming documents are generated by randomly sampling words from a Zipfian distribution. Heaps Law holds on distribution of other data Own experiments on types of questions asked by users show a similar behavior

    21. Exercise We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However, we only know statistics computed on smaller corpora sizes: For 100,000 words, there are 50,000 unique words For 500,000 words, there are 150,000 unique words Estimate the vocabulary size for the 1,000,000 words corpus How about for a corpus of 1,000,000,000 words?

More Related