1 / 51

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong. Overview. Joint Probability with Markov Random Fields (MRF) Conditional Random Fields (CRF), a special case of MRF Inference for CRF

Télécharger la présentation

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

  2. Overview • Joint Probability with Markov Random Fields (MRF) • Conditional Random Fields (CRF), a special case of MRF • Inference for CRF • Parameter Estimation for CRF • Experimental Results

  3. Modeling Joint Probability • How do we model the joint probability distribution for a group of random variables? • With no independence assumption, the number of combinations is exponential • P(x_1 ... x_n) = | # of outcomes in each random variable | ^ (# of random variables) • With complete independence assumption, the number of combinations is linear, but is an oversimplification in most cases (variables are actually correlated) • P(x_1 ... x_n) = P(x_1) ... P(x_n) • | # of outcomes in each random variable | * | # of random variables | • Need some middle ground • model dependence and independence between random variables efficiently

  4. Markov Random Fields (MRF)

  5. Markov Random Fields (MRF) • MRF Definition: • Undirected graph G = (V, E) • Set of random variables indexed by the nodes V • Edges represent correlations between random variables • Graph G is a MRF if X satisfies the local Markov property • Local Markov Property • variable is conditionally independent of all other variables, given its neighbors in the graph G • N(v) are the neighbors (nodes that are directly connected to XV by a single edge)

  6. Markov Random Fields (MRF) • What is the significance of a MRF? • compact, graphical representation of dependencies between variables • each variable only depends on its immediate neighbors • these conditional independencies imply that we can factorize the joint probability • Factorization simplifies computations and reduces amount of calculation needed • remember how the independence assumption simplified calculation? • How do we factorize the joint probability? • Factorize into functions on cliques • Hammersley Clifford Theorem proves this is valid

  7. Cliques • Clique Definition • A clique is a complete subgraph of G • a complete subgraph is a subset of vertices such that every 2 distinct vertices in the subgraph are adjacent • Example: • Red groups are cliques of 1 node • Orange group are also a cliques because: • A, B, C are all adjacent (but A not adjacent to D) • C, D are adjacent (but D not adjacent to A, B)

  8. Hammersley-Clifford Theorem • Called the fundamental theorem of random fields • Definition: Markov Random Field is defined by the following the joint probability: • C is the set of all cliques, x_c is the set of random variables in clique c • F_c is some "potential function" that acts on clique c (is strictly positive) • Z is the partition function (normalizing constant to make probability sum to 1) • P(X) is the joint probability of the set of random variables • The joint probability can be factorized into the product of "clique potentials"

  9. Factorization Example • Cliques are: • (A), (B), (C), (D), (AB), (AC), (BC), (CD), (ABC) • Maximal cliques (not a subset of another clique): • (ABC), (CD) • Therefore, if we only consider maximal cliques:

  10. Clique Potentials • Clique potentials are usually written as an exponential function • { f_k } are k local features on x_c • w_k are weights for each feature f_k • Allows clique potential to be strictly positive • Parameterize clique potential using user-defined local features • Allows the joint probability to be written as:

  11. Recap: MRF • Set of variables, some are dependent, some are independent • MRF lets us compactly model a joint distribution, with some independence assumptions • Hammersley-Clifford theorem lets us factorize joint prob. into clique potentials • Clique potentials can be parameterized using local features and weights • TLDR:

  12. Part-of-Speech Tagging • How would we use a MRF? • Part-of-Speech Tagging Problem: • Model 2 sequences of random variables (length N each) • X - input - sequence of words / a sentence (observations) • Y - output - sequence of labels / tags (hidden states) X: [bob ] [made] [her ] [happy ] [the ] [other ] [day ] Y: [noun] [verb] [noun] [adverb] [article] [adjective] [noun]

  13. Discriminative vs Generative Models • But MRF and HMM are both generative models • Uses a joint distribution P(X, Y) • We don't want to have to model P(X) explicitly, if we only observe a subset of it • Modeling P(X) requires making a lot of assumptions • Discriminative model • Uses conditional probability P(Y | X) • Doesn't model P(X), is just conditioned on it instead • Conditional Random Fields are a special case of MRF that are discriminative

  14. Conditional Random Fields (CRF) • We have graph on a set of random variables {X, Y}, but then fix the observed variables {X} • If the nodes for random variables {Y} obey the Markov Property, {X, Y} is a CRF

  15. Conditional Random Fields • We can define a conditional probability instead of a joint probability for CRFs • Z(x) is a normalization constant for x • The conditional probability factorizes into functions on cliques, just like MRF

  16. Linear Chain CRF • Same graph as linear-chain MRF • Hidden states (labels) form a sequence, and are conditioned on observations (words) • We observe sequence X (white nodes) • We don’t make any assumption on the relationship between Xs • Cliques are Nodes and Edges • The CRF paper splits features into edge features and vertex features

  17. Defining the CRF Model • Conditional Probability • y is a sequence of hidden states, each state of which can take on one of values • x is a sequence of observations, each observation can take on one of values

  18. Defining the CRF Model • Conditional Probability • Features are given and fixed • f_k are features on "hidden state edges" (ex: Y_i is a noun and Y_j is a verb given X) • g_k are features on "hidden state vertices" (ex: Y_i is a noun given X) • lambda_k and mu_k are parameters for each feature • Z(x) is normalization based on all the observations x

  19. Defining the CRF Model • Since the CRF is a linear chain, we can define "transition weights" from one hidden state in the sequence to the next hidden state • hidden state ( i ) takes on value y • hidden state ( i - 1 ) takes on value y'

  20. Defining the CRF Model • Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) • Let’s look at an example first

  21. Conditional Probability Example • We have hidden state sequence • start and end states • Want to find probability of sequence of states, given X

  22. Y_S, Y1, Y2, Y3, Y_E:Hidden states • A, B, Start, End: Values that the hidden states has taken • Edges in the graph: • Looking at all the edges between two hidden states Yi-1 and Yi: y S A B E S A B E y'

  23. Defining the CRF Model • Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) • We can use this matrix to define Z and the P(Y|X)

  24. Recap: CRF • Conditional Random Fields follow from MRF • Discriminative model instead of Generative • All the advantages of MRF (compactly models dependence assumptions) • Conditional Random Fields factor the conditional probability into: • features that act on cliques • weights for each feature • cliques are edges and nodes in graph • Questions: • How to perform inference? • How to train (parameter estimation)?

  25. Inference • How do we perform inference if we know model parameters? • How to find the most likely hidden state sequence y? • To predict the label sequence, we maximize the conditional probability: • We use the Viterbi Algorithm

  26. Viterbi Algorithm • Given the model, find the most likely sequence of hidden states • Approach: Recursion + Dynamic Programming (same as HMM) • Update for HMM: • Update for CRF: • S is the set of values y can take on. i, j are values in S • delta_t ( j ) is the maximum "probability" of the most likely path ending at y_t = j

  27. Calculating Marginal Probabilities • How do we calculate the most likely label for a specific state in the sequence? (or most likely transition for a pair of states?) • Use the Forward-Backward algorithm to calculate marginal probabilities • Probability of an edge/vertex is the normalized sum of all paths through that edge / vertex • Use forward and backward vectors to cache these sums

  28. Calculating Marginal Probabilities • To calculate probability of an edge being in a path: • To calculate probability of a vertex being in a path:

  29. Parameter Estimation for CRF • We want to find the best values for μ,λ

  30. Objective Function • How do we define which parameters are best? • Normalized Log Likelihood Function

  31. Improved Iterative Scaling Algorithm • We want to change the parameters in a way that increases the log likelihood • Trying to maximize this directly results in a set of highly coupled equations • We instead maximize a simpler lower bound

  32. Improved Iterative Scaling Algorithm • Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood

  33. Improved Iterative Scaling Algorithm • Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood

  34. Algorithm S • How do we sum over varying T(x,y)? • How do we sum over all y (exponential number of combinations)?

  35. Algorithm S • Idea 1: Use a slack feature (i.e. upper bound) S instead of T(x,y)

  36. Algorithm S • Idea 2: Since each feature only depends on a single edge or vertex, sum over all possible edges/vertices instead of sequences (using marginal probabilities)

  37. Putting it Together

  38. Final Update Equations • Define update equation for μk similarly, using marginal probability of vertex instead of edge

  39. Improving on Algorithm S • S is usually very large (proportional to the length of the longest training sequence) • Dataset has sequences of varying length • Large S causes parameter updates to be very small • Long time to convergence • Can we use a better approximation of T(x,y)?

  40. Algorithm T • Instead of taking a global upper bound on T(x,y), take the upper bound given x (per-sequence S calculation):

  41. Algorithm T • Group sums by the values of T(x)

  42. Algorithm T • We can use Newton’s method to find the root of the resulting polynomials

  43. Experimental Results • Experiments • Modeling Mixed-Order Sources • Position-of-Speech (POS) Tagging • Models Tested • Hidden Markov Model (HMM): Generative • Conditional Random Field (CRF): Discriminative • Maximum-Entropy Markov Model (MEMM): Discriminative • Condition locally on the current hidden state only - without normalizing the probabilities globally • Suffered from the Label Bias Problem

  44. Modeling Mixed-Order Sources • Data Generation • Synthetic data by randomly chosen HMM, mixture of first-order and second-order models • State transition probability: pα(yi | yi−1, yi−2) =α p2(yi | yi−1, yi−2) + (1 − α) p1(yi | yi−1) • Emission probability: pα(xi | yi, xi−1) =α p2(xi | yi, xi−1)+(1−α) p1(xi | yi) • Training and Testing data: 1000 sequence of length 25 • Training and Testing • Algorithm S (CRF), Viterbi Algorithm to label a test set • MEMMs and CRFs do not use overlapping features for observations

  45. Modeling Mixed-Order Sources • Results • Error rates increase for all models when data become “more second order” • CRF typically outperforms MEMM, except for a few cases with small error rate (a < 0.01) • Maybe insufficient number of CRF training iterations • HMM almost always outperforms MEMM • CRF typically outperforms HMM when data are second-order (a > ½) a < 1/2 a > 1/2 a < 0.01

  46. Position-of-Speech (POS) Tagging • Dataset • Penn Treebank part-of-speech tagset • 45 syntactic tags • 50% training data, 50% testing data • Experiment #1 • First-order HMM, MEMM, CRF • Results • CRF > HMM > MEMM • Labeling Bias Problem

More Related