CS 595-052 Machine Learning and Statistical Natural Language Processing

CS 595-052 Machine Learning and StatisticalNatural Language Processing Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: • Several programming projects • Research Proposal

Machine Learning Test Examples Learning Algorithm Learned Model Training Examples Classification/ Labeling Results

Modeling • Decide how to represent learned models: • Decision rules • Linear functions • Markov models • … • Type chosen affects generalization accuracy (on new data)

Generalization

Example Representation • Set of Features: • Continuous • Discrete (ordered and unordered) • Binary • Sets vs. Sequences • Classes: • Continuous vs. discrete • Binary vs. multivalued • Disjoint vs. overlapping

Learning Algorithms • Find a “good” hypothesis “consistent” with the training data • Many hypotheses may be consistent, so may need a “preference bias” • No hypothesis may be consistent, so need to find “nearly” consistent • May rule out some hypotheses to start with: • Feature reduction

Estimating Generalization Accuracy • Accuracy on the training says nothing about new examples! • Must train and test on different example sets • Estimate generalization accuracy over multiple train/test divisions • Sources of estimation error: • Bias: Systematic error in the estimate • Variance: How much the estimate changes between different runs

Cross-validation • Divide training into k sets • Repeat for each set: • Train on the remaining k-1 sets • Test on the kth • Average k accuracies (and compute statistics)

Bootstrapping For a corpus of n examples: • Choose n examples randomly (with replacement) Note: We expect ~0.632ndifferent examples • Train model, and evaluate: • acc0 = accuracy of model on non-chosen examples • accS = accuracy of model on n training examples • Estimate accuracy as 0.632*acc0 + 0.368*accS • Average accuracies over b different runs Also note: there are other similar bootstrapping techniques

Bootstrapping vs. Cross-validation • Cross-validation: • Equal participation of all examples • Dependency of class distribution in tests on distributions in training • Stratified cross-validation: equalize class dist. • Bootstrap: • Often has higher bias (fewer distinct examples) • Best for small datasets

Natural Language Processing • Extract useful information from natural language texts (articles, books, web pages, queries, etc.) • Traditional method: Handcrafted lexicons, grammars, parsers • Statistical approach: Learn how to process language from a corpus of real usage

Some Statistical NLP Tasks • Part of speech tagging - How to distinguish between book the noun, and book the verb. • Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going • Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. • Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.

A Paradigmatic Task • Language Modeling: Predict the next word of a text (probabilistically): P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1) • To do this perfectly, we must capture true notions of grammaticality • So: Better approximation of prob. of “the next word”  Better language model

Measuring “Surprise” • The lower the probability of the actual word, the more the model is “surprised”: H(wn | w1…wn-1) = -log2m(wn | w1…wn-1) (The conditional entropy of wn given w1,n-1) Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = wp(wn=w|w1,n-1)H(wn=w|w1,n-1) = Ep[-log2m(wn | w1,n-1)]

Estimating the Cross-Entropy How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p? Assume: • Stationarity: The language doesn’t change • Ergodicity: The language never gets “stuck” Then: Ep[H(wn|w1,n-1)] = limn (1/n) nH(wn | w1,n-1)

Perplexity Commonly used measure of “model fit”: perplexity(w1,n,m) = 2H(w1,n,m) = m(w1,n)-(1/n) How many “choices” for next word on average? • Lower perplexity = better model

N-gram Models • Assume a “limited horizon”: P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) • Each word depends only on the last n-1 words • Specific cases: • Unigram model: P(wk) – words independent • Bigram model: P(wk | wk-1) • Learning task: estimate these probabilities from a given corpus

Maximum Likelihood Estimation • PMLE(w1…wn) = C(w1…wn) / N • PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1) • Problem:Data Sparseness!! • For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus • The larger the context, the greater the problem • But there are always new cases not seen before!

Smoothing • Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B where X is any entity, B is the number of entity types • Problem: Assigns too much probability to new events The more event types there are, the worse this becomes

Interpolation Lidstone: PLid(X) = (C(X) + d) / (N + dB) [ d < 1 ] Johnson: PLid(X) = m PMLE(X) + (1 – m)(1/B) where m = N/(N+dB) • How to choose d? • Doesn’t match low-frequency events well

Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data • Divide data: “training” &“held out” subsets C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r

Deleted Estimation Generalize to use all the data : • Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb = X:Ca(X)=r Cb(X) Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r] • Needs a large data set • Overestimates unseen data, underestimates infrequent data

Good-Turing For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] • The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N • So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)

Good-Turing Issues • Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0) Usual answers: • Use only for low-frequency items (r < k) • Smooth E[Nr] by function S(r) • How to divide probability among unseen items? • Uniform distribution • Estimate which seem more likely than others…

{ (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1) if enough data (wi-n+1,i-1)Pbo(wi|wi-n+2,i-1) otherwise Back-off Models • If high-order n-gram has insufficient data, use lower order n-gram: Pbo(wi|wi-n+1,i-1) = • Note recursive formulation

Linear Interpolation More generally, we can interpolate: Pint(wi|h) = kk(h)Pk(wi| h) • Interpolation between different orders • Usually set weights by iterative training (gradient descent – EM algorithm) • Partition histories h into equivalence classes • Need to be responsive to the amount of data!

CS 595-052 Machine Learning and Statistical Natural Language Processing

CS 595-052 Machine Learning and Statistical Natural Language Processing

Presentation Transcript

Statistical Natural Language Processing

CS 388: Natural Language Processing Introduction

CS 388: Natural Language Processing: Statistical Parsing

CS 388: Natural Language Processing Machine Transla tion

Statistical Natural Language Processing

Statistical Natural Language Processing

Machine Learning for Natural Language Processing

Statistical Natural Language Processing

An Introduction to Machine Learning and Natural Language Processing Tools

Language Technology Machine learning of natural language

CS 460 natural language processing

Machine Learning Natural Language Processing

CS 294-5: Statistical Natural Language Processing

Statistical Natural Language Processing

Statistical Learning Methods in Natural Language Processing

CS 595-052 Machine Learning and Statistical Natural Language Processing

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation

Machine Learning in Natural Language

Statistical Natural Language Processing

CS 391L: Machine Learning Natural Language Learning

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation