Natural Language Processing
530 likes | 660 Vues
This lecture by Jim Martin dives into advanced language modeling techniques essential for Natural Language Processing (NLP). It covers topics such as N-grams, various smoothing methods like Laplace and Good-Turing smoothing, and challenges related to zero counts in language models. The lecture discusses how to estimate probabilities for unseen bigrams and the impact of zero counts on analysis and generation. Participants will gain insights into effective smoothing strategies, addressing zero frequencies, and understanding the implications of Zipf’s Law in language data.
Natural Language Processing
E N D
Presentation Transcript
Natural Language Processing Lecture 7—9/19/2013 Jim Martin
Today • More Language modeling (N-grams) • Smoothing • Finish Good-Turing • Pretty good smoothing • Bayesian prior smoothing • Word classes • Part of speech tagging Speech and Language Processing - Jurafsky and Martin
SmoothingDealing w/ Zero Counts • Back to Shakespeare • Recall that Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... • So, 99.96% of the possible bigrams were never seen (have zero entries in the table) • Does that mean that any sentence that contains one of those bigrams should have a probability of 0? • For generation (shannon game) it means we’ll never emit those bigrams • But for analysis it’s problematic because if we run across a new bigram in the future then we have no choice but to assign it a probability of zero.. Speech and Language Processing - Jurafsky and Martin
Zero Counts • Some of those zeros are really zeros... • Things that really aren’t ever going to happen • Fewer of these than you might think • On the other hand, some of them are just rare events. • If the training corpus had been a little bigger they would have had a count • What would that count be in all likelihood? Speech and Language Processing - Jurafsky and Martin
Zero Counts • Zipf’s Law (long tail phenomenon) • A small number of events occur with high frequency • A large number of events occur with low frequency • You can quickly collect statistics on the high frequency events • You might have to wait an arbitrarily long time to get good statistics on low frequency events • Result • Our estimates are necessarily sparse! We have no counts at all for the vast number of events we want to estimate. • Answer • Estimate the likelihood of unseen (zero count) N-grams! Speech and Language Processing - Jurafsky and Martin
Laplace Smoothing • Also called Add-One smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate: • Reconstructed counts: Speech and Language Processing - Jurafsky and Martin
BERP Bigram Counts Speech and Language Processing - Jurafsky and Martin
Laplace-Smoothed Bigram Counts Speech and Language Processing - Jurafsky and Martin
Laplace-Smoothed Bigram Probabilities Speech and Language Processing - Jurafsky and Martin
Reconstituted Counts Speech and Language Processing - Jurafsky and Martin
Reconstituted Counts (2) Speech and Language Processing - Jurafsky and Martin
Big Change to the Counts! • C(want to) went from 608 to 238! • P(to|want) from .66 to .26! • Discount d= c*/c • d for “chinese food” =.10!!! A 10x reduction • So in general, Laplace is a blunt instrument • Could use more fine-grained method (add-k) • But Laplace smoothing not generally used for N-grams, as we have much better methods • Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially • For pilot studies • In document classification • Information retrieval • In domains where the number of zeros isn’t so huge. Speech and Language Processing - Jurafsky and Martin
Fun with Unix • Thanks to Ken Church • Unix for Poets Speech and Language Processing - Jurafsky and Martin
Better Smoothing • Intuition used by many smoothing algorithms • Good-Turing • Kneser-Ney • Witten-Bell Use the count of things we’ve seen once to help estimate the count of things we’ve never seen Speech and Language Processing - Jurafsky and Martin
One Fish Two Fish • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • Not sure where this fishing hole is... • You have caught up to now • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish to be caught is an eel? • How likely is it that the next fish caught will be a member of newly seen species? • Now how likely is it that the next fish caught will be an eel? Slide adapted from Josh Goodman Speech and Language Processing - Jurafsky and Martin
Good-Turing • Notation: Nx is the frequency-of-frequency-x • So N10=1 • Number of fish species seen 10 times is 1 (carp) • N1=3 • Number of fish species seen 1 is 3 (trout, salmon, eel) • To estimate the probability of an unseen species • Use number of species (words) we’ve seen once • c0* =c1p0 = N1/N • All other estimates are adjusted downward to account for unseen probabilities 3/18 c*(eel) = c*(1) = (1+1) 1/ 3 = 2/3 Slide from Josh Goodman Speech and Language Processing - Jurafsky and Martin
Bigram Frequencies of Frequencies and GT Re-estimates Speech and Language Processing - Jurafsky and Martin
Bigram Frequencies of Frequencies and GT Re-estimates 3*= 4 * (381/642) = 4 * .593 = 2.37 Speech and Language Processing - Jurafsky and Martin
GT Smoothed Bigram Probabilities Speech and Language Processing - Jurafsky and Martin
GT Complications • In practice, assume large counts (c>k for some k) are reliable: • Also, need all the N_k to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them Speech and Language Processing - Jurafsky and Martin
Pretty Good Smoothing • Maximum Likelihood Estimation • Laplace Smoothing • Bayesian prior Smoothing Speech and Language Processing - Jurafsky and Martin 21
Pretty Good Smoothing Why is there a 1 here? • Bayesian prior smoothing Speech and Language Processing - Jurafsky and Martin
Toolkits • With FSAs/FSTs... • Openfst.org • For language modeling • SRILM • SRI Language Modeling Toolkit • All the bells and whistles you can imagine Speech and Language Processing - Jurafsky and Martin
Break • HW Questions? Speech and Language Processing - Jurafsky and Martin
Break • Quiz is Thursday Oct 3. • Chapters 1 to 6 • I’ll post specific readings (when enough people remind (nag) me) Speech and Language Processing - Jurafsky and Martin
Back to Some Linguistics Speech and Language Processing - Jurafsky and Martin
Word Classes:Parts of Speech • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. • Also known as • parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... • Lots of debate within linguistics and cognitive science community about the number, nature, and universality of these • We’ll completely ignore this debate Speech and Language Processing - Jurafsky and Martin
POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adjective purple, tall, ridiculous • ADV adverb unfortunately, slowly • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those Speech and Language Processing - Jurafsky and Martin
POS Tagging • The process of assigning a part-of-speech marker to each word in a some text. WORD tag the DET koala N put V the DET keys N on P the DET table N Speech and Language Processing - Jurafsky and Martin
Why POS Tagging is Useful • First step of a vast number of practical tasks • Speech synthesis • How to pronounce “lead”? • INsult inSULT • OBject obJECT • OVERflow overFLOW • DIScount disCOUNT • CONtent conTENT • Parsing • Helpful to know parts of speech before you start parsing • Analogy to lex/yacc (flex/bison) • Information extraction • Finding names, relations, etc. • Machine Translation Speech and Language Processing - Jurafsky and Martin
Open and Closed Classes • Closed class: a small(ish) fixed membership • Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have these 4, but not all! • Nouns are typically where the bulk of the action is with respect to new items Speech and Language Processing - Jurafsky and Martin
Open Class Words • Nouns • Proper nouns (Boulder, Granby, Beyoncé, Cairo) • English capitalizes these • Common nouns (the rest) • Count nouns and mass nouns • Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don’t get counted (snow, salt, communism) (*two snows) • Adverbs: tend to modify things • Unfortunately, Johnwalked home extremely slowly yesterday • Directional/locative adverbs (here, home, downhill) • Degree adverbs (extremely, very, somewhat) • Manner adverbs (slowly, slinkily, delicately) • Verbs • In English, have morphological affixes (eat/eats/eaten) • With differing patterns of regularity Speech and Language Processing - Jurafsky and Martin
Closed Class Words Examples: • prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, .. • conjunctions: and, but, or, … • auxiliary verbs: can, may should, … • numerals: one, two, three, third, … Speech and Language Processing - Jurafsky and Martin
Prepositions from CELEX Speech and Language Processing - Jurafsky and Martin
English Particles Speech and Language Processing - Jurafsky and Martin
Conjunctions Speech and Language Processing - Jurafsky and Martin
POS Tagging:Choosing a Tagset • There are many potential distinctions we can draw leading to potentially large tagsets • To do POS tagging, we need to choose a standard set of tags to work with • Could pick very coarse tagsets • N, V, Adj, Adv. • More commonly used set is the finer grained, “Penn TreeBank tagset”, 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist Speech and Language Processing - Jurafsky and Martin
Penn TreeBank POS Tagset Speech and Language Processing - Jurafsky and Martin
POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. Speech and Language Processing - Jurafsky and Martin
How Hard is POS Tagging? Measuring Ambiguity Speech and Language Processing - Jurafsky and Martin
Two Methods for POS Tagging • Rule-based tagging • See the text • Stochastic • Probabilistic sequence models • HMM (Hidden Markov Model) tagging • MEMMs (Maximum Entropy Markov Models) Speech and Language Processing - Jurafsky and Martin
POS Tagging as Sequence Classification • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • What is the best sequence of tags that corresponds to this sequence of observations? • Probabilistic view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. Speech and Language Processing - Jurafsky and Martin
Getting to HMMs • We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. • Hat ^ means “our estimate of the best one” • Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin
Getting to HMMs • This equation is guaranteed to give us the best tag sequence • But how to make it operational? How to compute this value? • Intuition of Bayesian inference • Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute Speech and Language Processing - Jurafsky and Martin
Using Bayes Rule Speech and Language Processing - Jurafsky and Martin
Likelihood and Prior Speech and Language Processing - Jurafsky and Martin
Two Kinds of Probabilities • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be: • Compute P(NN|DT) by counting in a labeled corpus: Speech and Language Processing - Jurafsky and Martin
Two Kinds of Probabilities • Word likelihood probabilities p(wi|ti) • VBZ (3sg Pres verb) likely to be “is” • Compute P(is|VBZ) by counting in a labeled corpus: Speech and Language Processing - Jurafsky and Martin
Example: The Verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag? Speech and Language Processing - Jurafsky and Martin
Disambiguating “race” Speech and Language Processing - Jurafsky and Martin