Create Presentation
Download Presentation

Download Presentation

Natural Language Processing COMPSCI 423/723

Download Presentation
## Natural Language Processing COMPSCI 423/723

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Natural Language ProcessingCOMPSCI 423/723**Rohit Kate**What is Probability?**• In ordinary language, probability is the degree of certainty of an event: It is very probable that it will rain today. • Probability theory gives a formal mathematical framework to work with numerical estimates of certainty of events, using this one can: • Predict likelihood of combinations of events • Predict most likely outcome • Predict something given that something else**Why is Probability Theory Important for NLP?**• To obtain estimates for various outcomes of an ambiguity, for example: • Predict most likely parse or an interpretation of an ambiguous sentence given the context or background knowledge • Time flies like an arrow. • Time goes by fast: 0.9 • A particular type of flies “time flies” like an arrow: 0.05 • Measure speed of flies like you will measure speed of an arrow: 0.03 • … • Instead of using some ad hoc arithmetic to encode estimates of certainty, probability theory is preferable because it has sound mathematics behind it, for example rules for combining probabilities**Definitions**• Sample Space (Ω): Space of possible outcomes • Outcomes of throwing a dice: {1,2,3,4,5,6} • Event: Subset of a sample space • An even number will show up: {2,4,6} • Number 2 will show up: {2} • Probability function (or probability distribution): Mapping from events to a real number in [0,1], such that: P(Ω) = 1 For any events α and β if α П β = ø (disjoint) then P(α U β) = P(α) + P(β)**An Example**• For throwing a dice P({1,2,3,4,5,6}) = 1 (some outcome shows up) • Suppose each basic outcome is equally likely, since they are all disjoint and add up to 1, we will have P({1}) = 1/6, P({2})=1/6, P({3})=1/6… • P(an even number shows up) = P({2,4,6}) = P({2}) + P({4}) + P({6}) = 1/6 + 1/6 + 1/6 = 1/2 • P(an even number or 3 shows up) = 1/2 + 1/6= 2/3 • P(an even number or 6 shows up) ≠ 1/2 + 1/6 why? P({2,4,6}) = 1/2**Interpretation of Probability**• Frequentist interpretation: P({3})=1/6: If a dice is thrown multiple times then 1/6th of the times 3 will show up P(It will rain tomorrow) = 1/2 ?? • Subjective interpretation: One’s degree of belief that the event will happen The mathematical rules should hold for both interpretations.**Estimating Probabilities**• For well defined sample spaces and events, they can be analytically estimated: P({3}) = 1/6 (assuming fair dice) • For many other sample spaces it is not possible to analytically estimate, for example P(A teenager will drink and drive). • For these cases they can be empirically estimated from a good sample, P(A teenager will drink and drive) = # of Teenagers who drink and drive/# of teenagers**Conditional Probability**• Updated probability of an event given that some event happened P({2}) = 1/6 P({2} given an even number showed up) = ? Represented as: P({2}|{2,4,6}) or P(A|B) P(A|B) = P(AПB)/P(B) (for P(B) > 0) P({2}|{2,4,6}) = P({2})/P({2,4,6}) = 1/6/(1/2) = 1/3**Multiplicative and Chain Rules**• Multiplicative rule: P(AПB) = P(B)P(A|B) = P(A)P(B|A) • Generalization of the rule, chain rule: P(A1ПA2П…ПAn) = P(A1)P(A2|A1)P(A3|A1ПA2)…P(An|A1П..ПAn-1) Or in any order of As**Independence**• Two events A and B are independent of each other if P(AПB) = P(A)P(B) or equivalently P(A)=P(A|B) or P(B)=P(B|A), i.e. happening of an event B does not change the probability of A or vice versa. Otherwise the events are dependent. Example: • P({1,2}) = 1/3 P(Even) = 1/2 • P({1,2} П Even) = P({2}) = 1/6 = P({1,2})*P(Even) • P({1,2}|Even) = P({1,2}ПEven)/P(Even) = 1/6/(1/2) = 1/3 Hence , P({1,2}|Even)=P({1,2}) • Given that an even number showed up does not change the probability of whether one or two showed up. Hence these are independent events.**Independence**• P({2}|Even) = P({2}ПEven)/P(Even) = P({2})/P{2,4,6} = 1/6/(1/2) = 1/3 P({2}) ≠P({2}|Even) • Given that even number showed up increases the probability that the number was 2, hence these are not independent events • Outcome of second throw of dice is supposed to be independent of the first throw of dice P(consecutive {2}) = P({2})*P({2}) = 1/6*1/6 = 1/36**Independence**• If A1, A2..An are independent then, P(A1ПA2П…ПAn) = P(A1)P(A2|A1)P(A3|A1ПA2)…P(An|A1П..ПAn-1) (chain rule) = P(A1)P(A2)P(A3)…P(An) Independence assumption is often used in NLP to simplify computations of complicated probabilities. S NP VP • For example, probability of a • parse tree is often simplified as the • product of probabilities of • generating individual productions. Article NN Verb NP ate The girl Article NN the cake**Conditional Independence**• Two events A and B are conditionally independent given C if P(AПB|C) = P(A|C)P(B|C) Conditional independence is encountered more often than unconditional independence.**Maximum Likelihood Estimate**• P(A teenager will drink and drive) = # of Teenagers who drink and drive/# of teenagers • Suppose out of a sample of 5 teenagers one drinks and drives P(A teenager will drink and drive) = 1/5 • Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) because they maximize the probability that it will generate the sampled data • Any other probability value will explain the data with less probability**Maximum Likelihood Estimates**• If P(TDD) = 1/5 then P(not TDD)=4/5 • Probability of the sampled data making assumption that each teenager is independent • P(Data)=1/5*(4/5)4=0.08192 • It will be less for any other value of P(TDD), for P(TDD)=1/6, • P(Data) = 1/6*(5/6)4 = 0.0803. • For P(TDD)=1/4 • P(Data) = 1/4*(3/4)4 = 0.07091 • Whenever possible to compute, simple frequency counts are not only intuitive but theoretically also the best probability estimates**Bayes’ Theorem**• Lets us calculate P(B|A) in terms of P(A|B) • For example, using Bayes’ theorem we can calculate P(Hypothesis|Evidence) in terms of P(Evidence|Hypothesis) which is usually easier to estimate.**Bayes Theorem**• Simple proof from definition of conditional probability: (Def. cond. prob.) (Def. cond. prob.) QED:**Random Variable**• Represents a measurable value associated with events, for example, the number that showed up on the dice, sum of the numbers of consecutive throws of a dice • Let X represent the number that showed up P(X=2) is the probability that 2 showed up • Let Z represent sum of the numbers that showed up on throwing the dice twice P(Z > 5) is the probability that the sum was greater than 5**Probability Distribution**• Probability distribution: Specification of probabilities for all the values of a random variable Example: X represents the number that shows up when a dice is thrown, a probability distribution (should add to 1) Given a probability distribution, probability of any event over the random variable can be computed, P(X>5), P(X=2 or 3)**Joint Probability Distribution**• The joint probability distribution for a set of random variables, X1,…,Xn gives the probability of every combination of values: P(X1,…,Xn) • Given a joint probability distribution, probability of any event over the random variables can be computed • Example: S: shape (circle, square) C: color (red, blue) L: label (positive, negative)**Joint Probability Distribution**• Joint distribution of P(S,C,L):**Marginal Probability Distributions**• The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution • In general, distribution of subset of random variables, for e.g. P(C) or P(C^S), can be computed from joint distribution, these are called marginal probability distributions**Conditional Probability Distributions**• Once marginal probabilities are computed, conditional probabilities can also be calculated • In general, distributions of subsets of random variables with conditions, for e.g. P(L|C^S), can also be computed, these are called conditional probability distributions**Language Models**Most of these slides have been adapted from Raymond Mooney’s slides from his NLP course at UT Austin.**What is a Language Model (LM)?**• Given a sentence how likely is it a sentence of the language? The dog bit the man. => very likely or 0.75 Dog man the the bit. => very unlikely or 0.002 The dog bit man. => likely or 0.15 • A probabilistic model is better than a formal grammar model which will only give a binary decision • To specify a correct probability distribution, the probability of all sentences in a language must sum to 1**What are the Uses of an LM?**• Speech recognition • “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry” • OCR & Handwriting recognition • More probable sentences are more likely correct readings • Machine translation • More likely sentences are probably better translations • Generation • More likely sentences are probably better NL generations • Context sensitive spelling correction • “Their are problems wit this sentence.”**What are the Uses of an LM?**• A language model also supports predicting the completion of a sentence. • Please turn off your cell _____ • Your program does not ______ • Predictive text input systems can guess what you are typing and give choices on how to complete it.**What is the probability of a sentence?**P(A1ПA2П…ПAn) = P(A1)P(A2|A1)P(A3|A1ПA2)…P(An|A1П..ПAn-1) (chain rule) P(Please,turn,off,your,cell,phone) = P(Please)P(turn|Please)P(off|Please,turn)P(your|Please,turn,off)P(cell|Please,turn,off,your)P(phone|Please,turn,off,your,cell) Estimate the above probabilities from a large corpus • Too many probabilities (parameters) to estimate • They become sparse, cannot be estimated well.**What is the probability of a sentence?**Approximate the probability by making independence assumptions P(Please,turn,off,your,cell,phone) = P(Please)P(turn|Please)P(off|Please,turn)P(your|Please,turn,off)P(cell|Please,turn,off,your)P(phone|Please,turn,off,your,cell) Bigram approximation: P(Please,turn,off,your,cell,phone) =P(Please) P(turn|Please)P(off|turn)P(your|off)P(cell|your)P(phone|cell) Trigram approximation: P(Please,turn,off,your,cell,phone) = P(Please)P(turn|Please)P(off|Please,turn)P(your|turn,off)P(cell|off,your)P(phone|your,cell)**N-Gram Model Formulas**• Word sequences • Chain rule of probability • Bigram approximation • N-gram approximation**Estimating Probabilities**• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences, they are the maximum likelihood estimates • To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words Bigram: N-gram:**Generative Model of a Language**• An N-gram model can be seen as a probabilistic automata for generating sentences. Start with an <s> symbol Until </s> is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words.**Training and Testing a Language Model**• A language model must be trained on a large corpus of text to estimate good parameter (probability) values • Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus • Ideally, the training (and test) corpus should be representative of the actual application data**Unknown Words**• How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words? • Train a model that includes an explicit symbol for an unknown word (<UNK>). • Choose a vocabulary in advance and replace other words in the training corpus with <UNK>. • Replace the first occurrence of each word in the training data with <UNK>.**Evaluation of an LM**• Ideally, evaluate use of model in end application (extrinsic) • Realistic • Expensive • Evaluate on ability to model test corpus (intrinsic) • Less realistic • Cheaper**Perplexity**• Measure of how well a model “fits” the test data • Uses the probability that the model assigns to the test corpus • Normalizes for the number of words in the test corpus and takes the inverse • Measures the weighted average branching factor in predicting the next word (lower is better)**Sample Perplexity Evaluation**• Models trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word vocabulary. • Evaluate on a disjoint set of 1.5 million WSJ words.**Smoothing**• Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so MLE incorrectly assigns zero to many parameters (also know as sparse data problem). • If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero • In practice, parameters are smoothed (also know as regularized) to reassign some probability mass to unseen events. • Adding probability mass to unseen events requires removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.**Laplace (Add-0ne) Smoothing**• “Hallucinate” additional training data in which each word occurs exactly once in every possible (N1)-gram context and adjust estimates accordingly. where V is the total number of possible words (i.e. the vocabulary size) Bigram: N-gram: • Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V) • More advanced smoothing techniques have also been developed**Model Combination**• As N increases, the power (expressiveness) of an N-gram model increases, but the ability to estimate accurate parameters from sparse data decreases (i.e. the smoothing problem gets worse). • A general approach is to combine the results of multiple N-gram models of increasing complexity (i.e. increasing N).**Interpolation**• Linearly combine estimates of N-gram models of increasing order Interpolated Trigram Model: Where: • Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (also known as tuning) corpus**Backoff**• Only use lower-order model when data for higher-order model is unavailable (i.e. count is zero). • Recursively back-off to weaker models until data is available. Where P* is a discounted probability estimate to reserve mass for unseen events and ’s are back-off weights**A Problem for N-Grams LMs:Long Distance Dependencies**• Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies • Syntactic dependencies • “The man next to the large oak tree near the grocery store on the corner is tall.” • “The men next to the large oak tree near the grocery store on the corner are tall.” • Semantic dependencies • “The bird next to the large oak tree near the grocery store on the corner flies rapidly.” • “The man next to the large oak tree near the grocery store on the corner talks rapidly.” • More complex models of language that use syntax and semantics are needed to handle such dependencies**Domain-specific LMs**• Using a domain-specific corpus one can build a domain-specific language model • For example, train a language model for each difficulty level (grades 1 to 12) • Then automatically predict difficulty level of a new document and recommend for the grade level**A Large LM**• Google had released a 5-gram LM trained on one trillion words in 2006 • Available through Linguistic Data Consortium (LDC) (not free) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 • Data Sizes • File sizes: approx. 24 GB compressed (gzip'ed) text files • Number of tokens: 1,024,908,267,229 • Number of sentences: 95,119,665,584 • Number of unigrams: 13,588,391 • Number of bigrams: 314,843,401 • Number of trigrams: 977,069,902 • Number of fourgrams: 1,313,818,354 • Number of fivegrams: 1,176,470,663**Homework 1, Due next week in class**Find marginal distributions for P(C^S) and P(C) from the joint distribution shown in slide #23. Compute the conditional distribution P(S|C) from this.