CPSC 503 Computational Linguistics

CPSC 503Computational Linguistics Lecture 6 Giuseppe Carenini CPSC503 Winter 2009

Today 25/9 • Finish n-grams • Model evaluation • Start Markov Models CPSC503 Winter 2009

Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2009

Two problems with applying: to CPSC503 Winter 2009

Problem (1) • We may need to multiply many very small numbers (underflows!) • Easy Solution: • Convert probabilities to logs and then do …… • To get the real probability (if you need it) go back to the antilog. CPSC503 Winter 2009

Problem (2) • The probability matrix for n-grams is sparse • How can we assign a probability to a sequence where one of the component n-grams has a value of zero? • Solutions: • Add-one smoothing • Good-Turing • Back off and Deleted Interpolation CPSC503 Winter 2009

Add-One • Make the zero counts 1. • Rationale: If you had seen these “rare” events chances are you would only have seen them once. unigram discount CPSC503 Winter 2009

Add-One: Bigram Counts …… CPSC503 Winter 2009

BERP Original vs. Add-one smoothed Counts 6 5 2 2 19 15 CPSC503 Winter 2009

Add-One Smoothed Problems • An excessive amount of probability mass is moved to the zero counts • When compared empirically with other smoothing techniques, it performs poorly • -> Not used in practice • Detailed discussion [Gale and Church 1994] CPSC503 Winter 2009

Better smoothing technique • Good-Turing Discounting (clear example on textbook) More advanced: Backoff and Interpolation • To estimate an ngram use “lower order” ngrams • Backoff: Rely on the lower order ngrams when we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams • Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts CPSC503 Winter 2009

N-Grams Summary: final Impossible to estimate! Sample space much bigger than any realistic corpus  Chain rule does not help  • Markov assumption : • Unigram… sample space? • Bigram … sample space? • Trigram … sample space? Sparse matrix: Smoothing techniques Look at practical issues: sec. 4.8 CPSC503 Winter 2009

You still need a big corpus… • The biggest one is the Web! • Impractical to download every page from the Web to count the ngrams => • Rely on page counts (which are only approximations) • Page can contain an ngram multiple times • Search Engines round-off their counts • Such “noise” is tolerable in practice  CPSC503 Winter 2009

Model Evaluation: Goal • You may want to compare: • 2-grams with 3-grams • two different smoothing techniques (given the same n-grams) On a given corpus… CPSC503 Winter 2008

Model Evaluation: Key Ideas Corpus A:split Training Set Testing set B: train models C:Apply models • counting • frequencies • smoothing • Compare results Models: Q1 and Q2 CPSC503 Winter 2008

Entropy • Def1. Measure of uncertainty • Def2. Measure of the information that we need to resolve an uncertain situation • Let p(x)=P(X=x); where x  X. • H(p)= H(X)= - xXp(x)log2p(x) • It is normally measured in bits. CPSC503 Winter 2009

Model Evaluation Actual distribution How different? Our approximation • Relative Entropy (KL divergence) ? • D(p||q)=xXp(x)log(p(x)/q(x)) CPSC503 Winter 2008

Entropy rate Language Entropy Assumptions: ergodic and stationary Entropy of Shannon-McMillan-Breiman NL? Entropy can be computed by taking the average log probability of a looooong sample CPSC503 Winter 2008

Applied to Language Cross-Entropy Between probability distribution P and another distribution Q (model for P) Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower CPSC503 Winter 2008

Model Evaluation: In practice Corpus A:split Training Set Testing set B: train models C:Apply models • counting • frequencies • smoothing • Compare cross-perplexities Models: Q1 and Q2 CPSC503 Winter 2008

k-fold cross validation and t-test • Randomly divide the corpus in k subsets of equal size • Use each for testing (all the other for training) • In practice you do k times what we saw in previous slide • Now for each model you have k perplexities • Compare average models perplexities with t-test CPSC503 Winter 2008

Example of a Markov Chain .6 1 a p h .4 .4 .3 .6 1 .3 e t 1 i .4 .6 .4 Start Start CPSC503 Winter 2007

Markov-Chain .6 1 t i p a h e 0 .3 0 .3 .4 0 Formal description: t a p 1 Stochastic Transition matrix A h .4 0 .6 0 0 0 .4 i 0 0 1 0 0 0 p .4 0 0 .4 .6 0 0 a .3 .6 1 .3 0 0 0 0 0 1 h 1 0 0 0 0 0 e e t 1 i 2 Probability of initial states .4 .6 .4 Start t .6 Start i .4 CPSC503 Winter 2007

Markov Assumptions • Let X=(X1, .., Xt) be a sequence of random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are: • (a) Limited Horizon: For all t,P(Xt+1|X1, .., Xt)=P(X t+1 | Xt) • (b)Time Invariant: For all t, P(X t+1 |Xt)=P(X2 | X1) i.e., the dependency does not change over time. CPSC503 Winter 2007

Markov-Chain Probability of a sequence of states X1 … XT Example: Similar to …….? CPSC503 Winter 2007

HMMs (and MEMM) intro They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence We have already seen a non-prob. version... • Used extensively in NLP • Part of Speech Tagging e.g Brainpower_NN ,_, not_RBphysical_JJplant_NN ,_, is_VBZnow_RBa_DTfirm_NN 's_POSchief_JJasset_NN ._. • Partial parsing • [NP The HD box] that [NP you] [VP ordered] [PP from][NP Shaw][VP never arrived] • Named entity recognition • [John Smith PERSON] left [IBM Corp. ORG] last summer CPSC503 Winter 2007

Hidden Markov Model(State Emission) .6 1 a .1 .5 b b s4 .4 s3 .4 .5 a .5 i .7 .6 .3 .1 1 i b s1 s2 a .9 .4 .6 .4 Start Start CPSC503 Winter 2007

Hidden Markov Model .6 1 a .1 .5 b Formal Specification as five-tuple b s4 .4 s3 .4 .5 a Set of States .5 i Output Alphabet .7 .6 .3 Initial State Probabilities 1 .1 i b s1 s2 State Transition Probabilities a .9 .4 .6 .4 Start Symbol Emission Probabilities Start CPSC503 Winter 2007

Next Time • Hidden Markov-Models (HMM) • Part-of-Speech (POS) tagging CPSC503 Winter 2009

CPSC 503 Computational Linguistics