Hidden Markov Models & Conditional Random Fields

Hidden Markov Models & Conditional Random Fields Eran Segal Weizmann Institute

Representing Time • Add the time dimension to variables X  X(t) • Assumptions • Time can be discretized into interesting points t1,t2,...tn • Markov assumption holds • Distribution is stationary (time invariant or homogeneous)

Hidden Markov Model • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(Y’|Y) assumed to be sparse • Usually encoded by a state transition graph • Observation probability P(X|Y) Y0 Y0 Y1 Y2 Y Y’ Y3 X0 X1 X2 X’ X3 G G0 Unrolled network

Hidden Markov Model • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(Y’|Y) assumed to be sparse • Usually encoded by a state transition graph • Observation probability P(X|Y) P(Y’|Y) Y1 Y2 Y3 Y4 State transition representation

Hidden Markov Models • Generative models • Assign a joint probability to paired label and observation • Parameters trained to maximize train joint likelihood • To make inference tractable, there are typically no long-range dependencies Y0 Y1 Y2 Y3 X1 X2 X3 Unrolled network

Conditional Models • Specifies the probability of possible label sequences given the observations, P(Y|X) • Does not “waste” parameters on modeling P(X|Y) • Key advantage: • Distribution over Y can depend on non-independent features without modeling feature dependencies • Transition probabilities can depend on past and future • Two representations • Maximum Entropy Markov Models (MEMMs) • Conditional Random Fields (CRFs)

Max Entropy Markov Models • Models the probability over the next state given the previous state and the observations • Training is by iterative scaling in the maximum entropy framework • Weakness: label bias problem Y0 Y1 Y2 Y0 Y1 Y2 Y3 Y3 X1 X2 X1 X2 X3 X3 HMM MEMM

Label-Bias Problem of MEMMs • Transitions from a state compete only with each other • Transition scores are conditional probabilities of next states given current state and observation • This implies a “conservation of score mass” whereby mass of a state is distributed among next states • Observations do not affect the total mass of next states •  Biases for states with fewer outgoing transitions • States with a single outgoing transition ignore observations • Unlike HMMs, observations after a branch point in a path are ignored

Label-Bias Problem: Example • A model for distinguishing ‘rob’ from ‘rib’ • Suppose we get an input sequence ‘rib’ • First step, ‘r’ matches both possible states so equally likely • Next, ‘i’ is observed, but since both y1 and y4 have have one outgoing state, they both give probability 1 to the next state • Does not happen in HMMs • Note: if one word is more likely in train it will win P(Y’|Y) i y1 y2 r b y0 y3 r y5 b y4 o State transition representation

Conditional Random Fields • Advantages of MEMMs without the label bias problem • Key difference • MEMMs use per-state model for conditional probabilities of next state given current state • CRFs have a single model for the joint probability of the entire sequence of labels given the observations • Thus, weights of different features at different states can trade off against each other • CRF training • Maximum likelihood or MAP • Loss function is convex, guaranteeing convergence to global optimum

Conditional Random Fields • Let G=(V,E) be a graph with vertices V and edges E, such that • Then (X,Y) is a CRF if the random variables Yv obey the Markov property with respect to the graph: • where Yj is the set of Y neighbors of Yi • Model only P(Y|X) Y0 Y1 Y2 Y3 X1 X2 X3 CRF

Conditional Random Fields • Joint probability distribution for trees over Y • Cliques (and thus potentials) are the edges and vertices • x are the observed variables • y are the state variables • y[S] is the components of Y associated with vertices in S • fk is an edge feature with weight λk • gk is a vertex feature with weight μk • Note that features can be over all of variables in x

Representing HMMs with CRFs • An HMM can be represented by the following CRF • where • Note that we defined one feature for each state pair (y’,y) and one feature for each state-observation pair (y,x)

Parameter Estimation

Applications

Conditional Random Fields • Joint probability distribution for trees over Y • Cliques (and thus potentials) are the edges and vertices

Hidden Markov Models & Conditional Random Fields