Statistical Natural Language Processing

Statistical Natural Language Processing Lecture 5 5/17/2011

Recommended reading • Zipf’s Law • Manning & Schutze 1.4.2-1.4.3 • Smoothing • Jurafsky & Martin 4.5

Outline • Counting features in corpora • Difficulties • Sparse data problem • Zipf’s Law • Dealing with sparse data • Unknown features • Smoothing

Review: N-gram • An N-gram is a sequence of length N of some unit, such as words or POS tags • N = 1: unigram N = 3: trigram • N = 2: bigram N = 4: 4-gram • Example: The quick brown fox • 4 (word) unigrams: The, quick, brown, fox • 3 bigrams: The quick, quick brown, brown fox • 2 trigrams: The quick brown, quick brown fox • 1 4-gram: The quick brown fox

Last time • Formulate NLP task as machine learning problem • Instances • Labels • Features • Annotated corpus for training classifier • For instances of interest, count occurrences of features Feature functions F Y: labels X: training instances

Training and testing sets • Get an annotated corpus • Split into training and testing sets • Proportions: in NLP, often 90% / 10% • Training set • Count occurrences of features for labeled instances • Train a classifier • Test set • Apply classifier do this data • Use annotations to evaluate performance Training set Testing set

Develop classifier (prediction model) • Training: • Read features for labeled instances • Train classifier: find a mapping from feature vectors to labels • Testing: • Read features for instances • Apply classifier to predict labels

Let’s say we’re doing POS tagging of wi • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Label of wi is ti • Choose features to try to predict the tag of a word

Features for POS tagging • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Possibly relevant features: • Current word, previous word, next word • Previous word bigram, next word bigram • Previous POS tag, next POS tag • Previous POS bigram, next POS bigram • Previous POS trigram, next POS trigram • Combinations of features

Example: “combination of features” as a feature • Training data: … ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 … … wi-3 wi-2 wi-1wi wi+1 wi+2 wi+3 … • Example: word and pos tag at a particular position • (wi-1, ti-1) • Not the same as the set of both individual features • wi-1 is a totally different feature • ti-1 is a totally different feature

Features are useful for prediction • Example: • Guess the tag of the word brown • The/DT quick/JJ brown • A classifier can use an entire vector of features for an instance Xi to make its prediction • Features for above example: • prev_w=quick • prev_bigram=the_quick • prev_pos=JJ • prev_pos_bigram=DT_JJ

Counting features in training data • A learning algorithm will use frequencies of features and tags from the training data in order to learn a predictive relationship between features and labels • Toy example: FeatureLabelFrequency prev_bigram=DT_JJ NN 40 prev_bigram=DT_JJ NNS 15 prev_bigram=DT_JJ JJ 10 prev_bigram=DT_JJ IN 2

How do we pick features? • Use careful linguistic analysis: • Specify exactly the features that are relevant • Use very detailed features • Very similar to rule-based approach • Problem: very hard to think of all possible cases • “Knowledge-light” approach: • Use a template to construct a large number of features • Let the algorithm decide which are relevant, according to their frequencies with diff. labels in the training data

Example • Careful approach: • Have a prev_bigram=x feature for all bigrams x that you think will actually matter for prediction • Knowledge-light: • Construct a feature prev_bigram=x for all bigrams in the corpus • Many of these will be linguistically irrelevant • Expect that relevant features will have a much stronger statistical association with particular labels in the training data, compared to irrelevant features

Additional context is useful for prediction • Examples • Task: tag the current word • up { IN, RP } 0 words of context • look up { IN, RP } 1 word of context • to look up { IN } 2 words of context • Task: predict POS tag of the next word • to __ { DT,VB,RB} 1 word of context • want to __ { VB,RB} 2 words of context • never want to __ {VB?} 3 words of context • So, it seems that such “larger” features are better

Maybe we could use a lot more features: • Current word, previous word, next word • Previous word bigram, next word bigram • Previous word trigram, next word trigram • Previous word 4-gram, next word 4-gram • Previous word 5-gram, next word 5-gram • Previous POS tag, next POS tag • Previous POS bigram, next POS bigram • Previous POS trigram, next POS trigram • Previous POS 4-gram, next POS 4-gram • Previous POS 5-gram, next POS 5-gram • …

Uh-oh: sparse data problem • For many types of features, there will often be zero occurrences of logically possible features values in a corpus • Example: quick brown beaver • Not found in the Brown corpus (1.2 million words) • Not found by Google either!

Sparse data causes a problem when applying classifier • Suppose you have an instance Xi that you want to classify. Compute its features. • Suppose it has a feature f(Xi) that was unseen in the training data. • How can we use this feature to predict a label Yi? • We have no knowledge of the relationship between f(Xi) and any labels: P( Yk | unseen-feature ) = divide by zero! • Example: The quick brown beaver jumped • quick brown beaver occurs zero times in training data • Not possible to predict a tag for jumped using this feature

Sparse data results from combinatorics • As the structures to be counted in a corpus get more and more complicated, fewer occurrences are expected to be seen • Example: exponential growth of # of word N-grams • Brown corpus: ~50,000 word types • Number of logically possible word N-grams • N = 1: 50,000 different unigrams • N = 2: 50,0002 = 12.5 billion different bigrams • N = 3: 50,0003 = 125 trillion different trigrams

Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types) • Suppose N-grams are uniformly distributed • Calculate expected frequencies of N-grams: • 50,000 word unigrams • 1.2 million / 50,000 = 24 occurrences • 12.5 billion word bigrams • 1.2 million / 12.5 billion = .000096 occurrences • 125 trillion word trigrams: • 1.2 million / 125 trillion = .0000000096 occurrences • Conclusion: for word N-grams, we probably shouldn’t use N > 1, otherwise it will lead to data sparsity

Actual counts of different units from the Brown corpus (1.2 million tokens)

What features should we choose? • Features that may be relevant to the task (using your linguistic knowledge), and don’t lead to data sparsity

Probability distributions in corpora • Previous example made the assumption that N-grams are uniformly distributed in a corpus • Leads to sparse data for sufficiently high N • However, in actual corpora, many distributions of natural language events appear to follow Zipf’s Law • Makes the sparse data problem even worse

Frequencies of words in a corpus: types and tokens • Brown corpus: • 1160743 word tokens • 49680 word types • Type: a distinct word • “with” • Token: an individual occurrence of a word • “with” occurs 7270 times

Frequency and rank • Sort words by decreasing frequency • Rank = order in sorted list • Rank 1: most-frequent word • Rank 2: second most-frequent word • etc. • Plot word frequencies by rank

Plot of word frequencies, linear scale Frequency ( in 10,000s ) Rank ( in 10,000s)

Plot of word frequencies, zoom in

Plot of word frequencies, log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

Plot of word frequencies, log-log scale 10 most-freq words: freq. > 10,000 Next 90 words: 1,000 < freq. < 10,000 Frequency (log scale) Next 900 words: 100 < freq. < 1,000 Next 9000: 10 < freq. < 100 25,000 words: 1 <= freq < 10 Rank (log scale)

Word frequency distributions in language • Exemplifies a power law distribution • For any corpus and any language: • There are a few very common words • A substantial number of medium freq. words • A huge number of low frequency words • Brown corpus • 1160743 tokens, 49680 types • 21,919 types appear only once! = 44.1%

Word frequencies follow Zipf’s law • Example of a power law distribution • Zipf’s Law: the frequencyF of a word w is inversely proportional to the rankR of w: F  1 / R i.e., F x R = k, for some constant k • Example: 50th most common word type should occur three times as freq. as 150th most common word type freq. at rank 50:  1 / 50 freq. at rank 150:  1 / 150 ( 1 / 50 ) / ( 1 / 150 ) = 3

Zipf’s Law explains linear-like relationship between freq. and rank in log-log scale Red line = constant k in Zipf’s Law

Most-frequent words

Some words with a frequency of 20 • replacement • repair • relating • rehabilitation • refund • receives • ranks • queen • quarrel • puts • pursue • purchased • punishment • promises • procurement • probabilities • precious • pitcher • pitch

Some words with a frequency of 1 • gorgeously • gorge • gooshey • goooolick • goofed • gooey • goody • goodness' • goodies • good-will • government-controlled • gouverne • goutte • gourmets • gourmet's • gothic • gossiping • gossiped • gossamer • gosh

Consequences of Zipf’s Law • Highly skewed distribution of linguistic forms • Words, bigrams, just about any construction • Makes sparse data problem worse • Previous section on sparse data problem assumed uniform distribution of words • Reality: expected frequencies are low, except for most common N-grams • For example, consider: • Expected freq. of two common words appearing as a bigram • Expected freq. of two rare words appearing as a bigram

Another consequence of Zipf’s Law: unknown words • Many possible words will not appear in a corpus • Sparse data problem even for word unigrams • Some common words not in Brown Corpus • combustible parabola • preprocess headquartering • deodorizer deodorizers • usurps usurping • Neologisms (newly-formed words) will not be in a corpus either

Statistical Natural Language Processing