Markov Models & POS Tagging

Markov Models & POS Tagging Nazife Dimililer 23/10/2012

Overview • Markov Models • Hidden Markov Models • POS Tagging

Andrei Markov • Russian statistician (1856 – 1922) • Studied temporal probability models • Markov assumption • Statet depends only on a bounded subset of State0:t-1 • First-order Markov process • P(Statet | State0:t-1) = P(Statet | Statet-1) • Second-order Markov process • P(Statet | State0:t-1) = P(Statet | Statet-2:t-1)

Markov Assumption • “The future does not depend on the past, given the present.” • Sometimes this if called the first-order Markov assumption. • second-order assumption would mean that each event depends on the previous two events. • This isn’t really a crucial distinction. • What’s crucial is that there is some limit on how far we are willing to look back.

Morkov Models • The Markov assumption means that there is only one probability to remember for each event type (e.g., word) to another event type. • Plus the probabilities of starting with a particular event. • This lets us define a Markov model as: • finite state automaton in which • the states represent possible event types(e.g., the different words in our example) • the transitions represent the probability of one event type following another. • It is easy to depict a Markov model as a graph.

Graphical Model (Can be interpreted as Bayesian Net) Circles indicate states Arrows indicate probabilistic dependencies between states State depends only on the previous state “The past is independent of the future given the present.” What is a (Visible) Markov Model ?

Markov Model Formalization A A A A S S S S S • {S, P, A} • S : {s1…sN } are the values for the hidden states • P = {pi} are the initial state probabilities • A = {aij} are the state transition probabilities

Markov chain for weather

Markov chain for words

Markov chain = “First-order observable Markov Model” • a set of states • Q = q1, q2…qN; the state at time t is qt • Transition probabilities: • a set of probabilities A = a01a02…an1…ann. • Each aijrepresents the probability of transitioning from state i to state j • The set of these is the transition probability matrix A • Distinguished start and end states

Markov chain = “First-order observable Markov Model” • Markov Assumption • Current state only depends on previous state P(qi | q1 … qi-1) = P(qi | qi-1)

Another representation for start state • Instead of start state • Special initial probability vector p • An initial distribution over probability of start states • Constraints:

The weather model using p

The weather model: specific example

Markov chain for weather • What is the probability of 4 consecutive warm days? • Sequence is warm-warm-warm-warm • I.e., state sequence is 3-3-3-3 P(3, 3, 3, 3) = 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432

How about? • Hot hot hot hot • Cold hot cold hot • What does the difference in these probabilities tell you about the real world weather info encoded in the figure?

Markov Models • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • To define Markov model, the following probabilities have to be specified: • transition probabilities and • initial probabilities

Example of Markov Model 0.3 0.7 Rain Dry 0.2 0.8 • Two states : ‘Rain’ and ‘Dry’. • Transition probabilities: • P(‘Rain’|‘Rain’)=0.3 , • P(‘Dry’|‘Rain’)=0.7 , • P(‘Rain’|‘Dry’)=0.2, • P(‘Dry’|‘Dry’)=0.8 • Initial probabilities: say • P(‘Rain’)=0.4 , • P(‘Dry’)=0.6 .

Calculation of sequence probability • By Markov chain property, probability of state sequence can be found by the formula: • Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}. • P({‘Dry’,’Dry’,’Rain’,Rain’} ) = • P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= • = 0.3*0.2*0.8*0.6

Hidden Markov Model (HMM) • Evidence can be observed, but the state is hidden • Three components • Priors (initial state probabilities) • State transition model • Evidence observation model • Changes are assumed to be caused by a stationary process • The transition and observation models do not change

Hidden Markov models. • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • States are not visible, but each state randomly generates one of M observations (or visible states)

Hidden Markov models. • Model is represented by • M=(A, B, ). • To define hidden Markov model, the following probabilities have to be specified: • matrix of transition probabilities • A=(aij), aij= P(si| sj) , • matrix of observation probabilities • B=(bi (vm)), bi(vm)= P(vm| si) and • a vector of initial probabilities • =(i), i= P(si) .

Graphical Model Circles indicate states Arrows indicate probabilistic dependencies between states HMM = Hidden Markov Model What is an HMM?

Green circles are hidden states Dependent only on the previous state “The past is independent of the future given the present.” What is an HMM?

Purple nodes are observed states Dependent only on their corresponding hidden state What is an HMM?

{S, K, P, A, B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations HMM Formalism S S S S S K K K K K

{S, K, P, A, B} P = {pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities Note : sometimes one uses B = {bijk} output then depends on previous state / transition as well HMM Formalism A A A A S S S S S B B B K K K K K

Example of Hidden Markov Model 0.3 0.7 Low High 0.2 0.8 Rain 0.6 0.6 0.4 0.4 Dry

Example of Hidden Markov Model • Two states : ‘Low’ and ‘High’ atmospheric pressure. • Two observations : ‘Rain’ and ‘Dry’. • Transition probabilities: • P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 , • P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8 • Observation probabilities : • P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , • P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 . • Initial probabilities: say • P(‘Low’)=0.4 , P(‘High’)=0.6 .

Calculation of observation sequence probability • Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}. • Consider all possible hidden state sequences: • P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) • where first term is : • P({‘Dry’,’Rain’} , {‘Low’,’Low’}) = P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) • = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) • = 0.4*0.4*0.6*0.4*0.3

Main issues using HMMs : • Evaluation problem. • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, • calculate the probability that model M has generated sequence O . • Decoding problem. • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, • calculate the most likely sequence of hidden states si that produced this observation sequence O. • Learning problem. • Given some training observation sequences O=o1 o2 ... oKand general structure of HMM (numbers of hidden and visible states), determineHMM parameters M=(A, B, ) that best fit training data. • O=o1...oKdenotes a sequence of observations ok{v1,…,vM}.

HMM for Ice Cream • You are a climatologist in the year 2799 • Studying global warming • You can’t find any records of the weather in Baltimore, MD for summer of 2008 • But you find Jason Eisner’s diary • Which lists how many ice-creams Jason ate every date that summer • Our job: figure out how hot it was

Hidden Markov Model • For Markov chains, the output symbols are the same as the states. • See hot weather: we’re in state hot • But in named-entity or part-of-speech tagging (and speech recognition and other things) • The output symbols are words • But the hidden states are something else • Part-of-speech tags • Named entity tags • So we need an extension! • A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. • This means we don’t know which state we are in.

POS Tagging : Current Performance Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj • How many tags are correct? • About 97% currently • But baseline is already 90% • Baseline is performance of stupidest possible method • Tag every word with its most frequent tag • Tag unknown words as nouns

HMMs as POS Taggers • Introduce Hidden Markov Models for part-of-speech tagging/sequence classification • Cover the three fundamental questions of HMMs: • How do we fit the model parameters of an HMM? • Given an HMM, how do we efficiently calculate the likelihood of an observation w? • Given an HMM and an observation w, how do we efficiently calculate the most likely state sequence for w?

s0 s1 s2 sn w1 w2 wn HMMs • We want a model of sequences s and observations w • Assumptions: • States are tag n-grams • Usually a dedicated start and end state / word • Tag/state sequence is generated by a markov model • Words are chosen independently, conditioned only on the tag/state • These are totally broken assumptions: why?

Part of speech tagging • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those

POS Tagging example WORD tag the DET koala N put V the DET keys N on P the DET table N

POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. These examples from Dekang Lin

POS tagging as a sequence classification task • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • She promised to back the bill • What is the best sequence of tags which corresponds to this sequence of observations? • Probabilistic view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

Getting to HMM • We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. • Hat ^ means “our estimate of the best one” • Argmaxx f(x) means “the x such that f(x) is maximized”

Getting to HMM • This equation is guaranteed to give us the best tag sequence • But how to make it operational? How to compute this value? • Intuition of Bayesian classification: • Use Bayes rule to transform into a set of other probabilities that are easier to compute

Using Bayes Rule

Likelihood and prior n

Two kinds of probabilities (1) • Tag transition probabilities P(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be: • Compute P(NN|DT) by counting in a labeled corpus:

Two kinds of probabilities (2) • Word likelihood probabilities p(wi|ti) • VBZ (3sg Pres verb) likely to be “is” • Compute P(is|VBZ) by counting in a labeled corpus:

POS tagging: likelihood and prior n

An Example: the verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag?

Disambiguating “race”

Markov Models & POS Tagging