1 / 40

Statistical Natural Language Processing

Statistical Natural Language Processing. Lecture 5 5/17/2011. Recommended reading. Zipf’s Law Manning & Schutze 1.4.2-1.4.3 Smoothing Jurafsky & Martin 4.5. Outline. Counting features in corpora Difficulties Sparse data problem Zipf’s Law Dealing with sparse data

mabyn
Télécharger la présentation

Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Natural Language Processing Lecture 5 5/17/2011

  2. Recommended reading • Zipf’s Law • Manning & Schutze 1.4.2-1.4.3 • Smoothing • Jurafsky & Martin 4.5

  3. Outline • Counting features in corpora • Difficulties • Sparse data problem • Zipf’s Law • Dealing with sparse data • Unknown features • Smoothing

  4. Review: N-gram • An N-gram is a sequence of length N of some unit, such as words or POS tags • N = 1: unigram N = 3: trigram • N = 2: bigram N = 4: 4-gram • Example: The quick brown fox • 4 (word) unigrams: The, quick, brown, fox • 3 bigrams: The quick, quick brown, brown fox • 2 trigrams: The quick brown, quick brown fox • 1 4-gram: The quick brown fox

  5. Last time • Formulate NLP task as machine learning problem • Instances • Labels • Features • Annotated corpus for training classifier • For instances of interest, count occurrences of features Feature functions F Y: labels X: training instances

  6. Training and testing sets • Get an annotated corpus • Split into training and testing sets • Proportions: in NLP, often 90% / 10% • Training set • Count occurrences of features for labeled instances • Train a classifier • Test set • Apply classifier do this data • Use annotations to evaluate performance Training set Testing set

  7. Develop classifier (prediction model) • Training: • Read features for labeled instances • Train classifier: find a mapping from feature vectors to labels • Testing: • Read features for instances • Apply classifier to predict labels

  8. Let’s say we’re doing POS tagging of wi • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Label of wi is ti • Choose features to try to predict the tag of a word

  9. Features for POS tagging • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Possibly relevant features: • Current word, previous word, next word • Previous word bigram, next word bigram • Previous POS tag, next POS tag • Previous POS bigram, next POS bigram • Previous POS trigram, next POS trigram • Combinations of features

  10. Example: “combination of features” as a feature • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Example: word and pos tag at a particular position • (wi-1, ti-1) • Not the same as the set of both individual features • wi-1 is a totally different feature • ti-1 is a totally different feature

  11. Features are useful for prediction • Example: • Guess the tag of the word brown • The/DT quick/JJ brown • A classifier can use an entire vector of features for an instance Xi to make its prediction • Features for above example: • prev_w=quick • prev_bigram=the_quick • prev_pos=JJ • prev_pos_bigram=DT_JJ

  12. Counting features in training data • A learning algorithm will use frequencies of features and tags from the training data in order to learn a predictive relationship between features and labels • Toy example: FeatureLabelFrequency prev_bigram=DT_JJ NN 40 prev_bigram=DT_JJ NNS 15 prev_bigram=DT_JJ JJ 10 prev_bigram=DT_JJ IN 2

  13. How do we pick features? • Use careful linguistic analysis: • Specify exactly the features that are relevant • Use very detailed features • Very similar to rule-based approach • Problem: very hard to think of all possible cases • “Knowledge-light” approach: • Use a template to construct a large number of features • Let the algorithm decide which are relevant, according to their frequencies with diff. labels in the training data

  14. Example • Careful approach: • Have a prev_bigram=x feature for all bigrams x that you think will actually matter for prediction • Knowledge-light: • Construct a feature prev_bigram=x for all bigrams in the corpus • Many of these will be linguistically irrelevant • Expect that relevant features will have a much stronger statistical association with particular labels in the training data, compared to irrelevant features

  15. Outline • Counting features in corpora • Difficulties • Sparse data problem • Zipf’s Law • Dealing with sparse data • Unknown features • Smoothing

  16. Additional context is useful for prediction • Examples • Task: tag the current word • up { IN, RP } 0 words of context • look up { IN, RP } 1 word of context • to look up { IN } 2 words of context • Task: predict POS tag of the next word • to __ { DT,VB,RB} 1 word of context • want to __ { VB,RB} 2 words of context • never want to __ {VB?} 3 words of context • So, it seems that such “larger” features are better

  17. Maybe we could use a lot more features: • Current word, previous word, next word • Previous word bigram, next word bigram • Previous word trigram, next word trigram • Previous word 4-gram, next word 4-gram • Previous word 5-gram, next word 5-gram • Previous POS tag, next POS tag • Previous POS bigram, next POS bigram • Previous POS trigram, next POS trigram • Previous POS 4-gram, next POS 4-gram • Previous POS 5-gram, next POS 5-gram • …

  18. Uh-oh: sparse data problem • For many types of features, there will often be zero occurrences of logically possible features values in a corpus • Example: quick brown beaver • Not found in the Brown corpus (1.2 million words) • Not found by Google either!

  19. Sparse data causes a problem when applying classifier • Suppose you have an instance Xi that you want to classify. Compute its features. • Suppose it has a feature f(Xi) that was unseen in the training data. • How can we use this feature to predict a label Yi? • We have no knowledge of the relationship between f(Xi) and any labels: P( Yk | unseen-feature ) = divide by zero! • Example: The quick brown beaver jumped • quick brown beaver occurs zero times in training data • Not possible to predict a tag for jumped using this feature

  20. Sparse data results from combinatorics • As the structures to be counted in a corpus get more and more complicated, fewer occurrences are expected to be seen • Example: exponential growth of # of word N-grams • Brown corpus: ~50,000 word types • Number of logically possible word N-grams • N = 1: 50,000 different unigrams • N = 2: 50,0002 = 12.5 billion different bigrams • N = 3: 50,0003 = 125 trillion different trigrams

  21. Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types) • Suppose N-grams are uniformly distributed • Calculate expected frequencies of N-grams: • 50,000 word unigrams • 1.2 million / 50,000 = 24 occurrences • 12.5 billion word bigrams • 1.2 million / 12.5 billion = .000096 occurrences • 125 trillion word trigrams: • 1.2 million / 125 trillion = .0000000096 occurrences • Conclusion: for word N-grams, we probably shouldn’t use N > 1, otherwise it will lead to data sparsity

  22. Actual counts of different units from the Brown corpus (1.2 million tokens)

  23. What features should we choose? • Features that may be relevant to the task (using your linguistic knowledge), and don’t lead to data sparsity

  24. Outline • Counting features in corpora • Difficulties • Sparse data problem • Zipf’s Law • Dealing with sparse data • Unknown features • Smoothing

  25. Probability distributions in corpora • Previous example made the assumption that N-grams are uniformly distributed in a corpus • Leads to sparse data for sufficiently high N • However, in actual corpora, many distributions of natural language events appear to follow Zipf’s Law • Makes the sparse data problem even worse

  26. Frequencies of words in a corpus: types and tokens • Brown corpus: • 1160743 word tokens • 49680 word types • Type: a distinct word • “with” • Token: an individual occurrence of a word • “with” occurs 7270 times

  27. Frequency and rank • Sort words by decreasing frequency • Rank = order in sorted list • Rank 1: most-frequent word • Rank 2: second most-frequent word • etc. • Plot word frequencies by rank

  28. Plot of word frequencies, linear scale Frequency ( in 10,000s ) Rank ( in 10,000s)

  29. Plot of word frequencies, zoom in

  30. Plot of word frequencies, log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

  31. Plot of word frequencies, log-log scale 10 most-freq words: freq. > 10,000 Next 90 words: 1,000 < freq. < 10,000 Frequency (log scale) Next 900 words: 100 < freq. < 1,000 Next 9000: 10 < freq. < 100 25,000 words: 1 <= freq < 10 Rank (log scale)

  32. Word frequency distributions in language • Exemplifies a power law distribution • For any corpus and any language: • There are a few very common words • A substantial number of medium freq. words • A huge number of low frequency words • Brown corpus • 1160743 tokens, 49680 types • 21,919 types appear only once! = 44.1%

  33. Word frequencies follow Zipf’s law • Example of a power law distribution • Zipf’s Law: the frequencyF of a word w is inversely proportional to the rankR of w: F  1 / R i.e., F x R = k, for some constant k • Example: 50th most common word type should occur three times as freq. as 150th most common word type freq. at rank 50:  1 / 50 freq. at rank 150:  1 / 150 ( 1 / 50 ) / ( 1 / 150 ) = 3

  34. Zipf’s Law explains linear-like relationship between freq. and rank in log-log scale Red line = constant k in Zipf’s Law

  35. Most-frequent words

  36. Some words with a frequency of 20 • replacement • repair • relating • rehabilitation • refund • receives • ranks • queen • quarrel • puts • pursue • purchased • punishment • promises • procurement • probabilities • precious • pitcher • pitch

  37. Some words with a frequency of 1 • gorgeously • gorge • gooshey • goooolick • goofed • gooey • goody • goodness' • goodies • good-will • government-controlled • gouverne • goutte • gourmets • gourmet's • gothic • gossiping • gossiped • gossamer • gosh

  38. Consequences of Zipf’s Law • Highly skewed distribution of linguistic forms • Words, bigrams, just about any construction • Makes sparse data problem worse • Previous section on sparse data problem assumed uniform distribution of words • Reality: expected frequencies are low, except for most common N-grams • For example, consider: • Expected freq. of two common words appearing as a bigram • Expected freq. of two rare words appearing as a bigram

  39. Another consequence of Zipf’s Law: unknown words • Many possible words will not appear in a corpus • Sparse data problem even for word unigrams • Some common words not in Brown Corpus • combustible parabola • preprocess headquartering • deodorizer deodorizers • usurps usurping • Neologisms (newly-formed words) will not be in a corpus either

More Related