1 / 30

Statistical Methods

Statistical Methods . Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to conditional random fields Can be used for language identification, text classification, information retrieval, and information extraction.

Télécharger la présentation

Statistical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Methods • Traditional grammars may be “brittle” • Statistical methods are built on formal theories • Vary in complexity from simple trigrams to conditional random fields • Can be used for language identification, text classification, information retrieval, and information extraction

  2. N-Grams • Text is comprised of characters (or words or phonemes) • An N-gram is a sequence of n consecutive characters (or words ...) • unigram, bigram, trigram • Technically, it is a Markov chain of order n-1 • P(ci | c1:i-1) = P(ci|ci-n+1:ci-1) • Calculate N-grams by looking at large corpus

  3. Example – Language Identification • Use P(ci|ci-2:ci-1,l), where l ranges over languages • About 100,000 characters of each language are needed • l* = argmaxl P(l|c1:N) = argmaxl P(l) P(ci|ci-2:ci-1,l) • Learn the model from a corpus • P(l), the probability of a given language can be estimated • Other examples: spelling correction, genre classification, and named-entity recognition

  4. Smoothing • Problem: What if a particular n-gram does not appear in the training corpus? • Probability would be 0 – should be a small, but positive number • Smoothing – adjusting the probability of low-frequency counts • Laplace: use 1/(n+2) instead of 0 (n observations) • Backoff model: back off to n-1 grams

  5. Model Evaluation • Use cross-validation (split corpus into training and evaluation sets) • Need a metric for evaluation • Can use perplexity to describe the probability of a sequence • Perplexity(c1:N) = P(c1:n)-1/N • Can be thought of as the reciprocal of probability normalized by the sequence length

  6. N-gram Word Models • Can be used for text classification • Example: spam vs. ham • Problem: out-of-vocabulary word • Trick: During training, use <UNK> first time word is seen, then after that use word regularly. Then when an unknown word is seen, treat it as <UNK> • Calculate probabilities from a corpus, then randomly generate phrases

  7. Example – Spam Detection • Text classification problem • Train for P(Message|spam) and P(Message|ham) using n-grams • Calculate P(message|spam) P(spam) and P(message|ham) P(ham) and take whichever is greater

  8. Spam Detection – Other Methods • Represent message as a set of feature/value pairs • Apply a classification algorithm for the feature vector • Strongly depends on the features chosen • Data compression • Data compression algorithms such a LZW look for commonly re-occurring sequences and replace later copies with pointers to earlier ones. • Append new message to list of spam messages and compress, do the same for ham, and whichever compresses smaller...

  9. Information Retrieval • Think WWW and search engines • Characterized by • Corpus of documents • Queries in some query language • Result set • Presentation of result sort (some ordering) • Methods: Simple Boolean keyword models, IR scoring functions, PageRank algorithm, HITS algorithm

  10. IR Scoring Function - BM25 • Okapi Project (Robertson, et. al.) • Three factors: • Frequency word appears in the document (TF) • The inverse document frequency (IDF) – inverse of times word appears in all documents • Length of document • |dj| is the length of the document, L is the average document length, k and b are tuned parameters

  11. BM25 cont'd.

  12. Precision and Recall • Precision measures the proportion of the documents in the result set that are actually relevant, e.g., if the result set contains 30 relevant documents and 10 non-relevant documents, precision is .75 • Recall is the proportion of relevant documents that are in the result set, e.g., if 30 relevant documents are in the result set out of a possible 50, recall is .60

  13. IR Refinement • Pivoted document length normalization • Longer documents tend to be favored • Instead of document length, use a different normalization function that can be tuned • Use word stems • Use synonyms • Look at metadata

  14. PageRank Algorithm (Google) • Count the links that point to the page • Weight links from “high-quality sites” higher • Minimizes the effect of creating lots of pages that point to the chosen page where PR(p) is the PageRank of p, N is the total number of pages in the corpus, xi is a page that link to p, and C(xi) is the count of the total number of out-links on the page xi

  15. Information Extraction • Ability to answer questions • Possibilities range from simple template matching to full-blown language understanding systems • May be domain specific or general • Used as DB front-end, or WWW searching • Examples: AskMSR, IBM's Watson, Wolfram Alpha, Siri

  16. Template Matching • Simple template matching (Weizenbaum's Eliza) • Regular Expression matching – finite state automata • Relational extraction methods – FASTUS: • Processing done in stages: Tokenization, Complex-word handling, Basic-group handling, Complex-phrase handling, Structure merging • Each stages uses a FSA

  17. Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical Decision-Tree Models Stochastic Methods for NLP

  18. Discrete random process: The system is in various states and we move from state to state. The probability of moving to a particular next state (a transition) depends solely on the current state and not previous states (the Markov property). May be modeled by a finite state machine with probabilities on the edges. Markov Chain

  19. Each state (or transition) may produce an output. The outputs are visible to the viewer, but the underlying Markov model is not. The problem is often to infer the path through the model given a sequence of outputs. The probabilities associated with the transitions are known a priori. There may be more than one start state. The probability of each start state may also be known. Hidden Markov Model

  20. Parts of speech (POS) tagging Speech recognition Handwriting recognition Machine Translation Cryptanalysis Many other non-NLP applications Uses of HMM

  21. Used to find the mostly likely sequence of states (the Viterbi path) in a HMM that leads to a given sequence of observed events. Runs in time proportional to (number of observations) * (number of states)2. Can be modified if the state depends on the last n states (instead of just the last state). Take time (number of observations) * (number of states)n Viterbi Algorithm

  22. The system at any given time is in one particular state. There are a finite number of states. Transitions have an associated incremental metric. Events are cumulative over a path, i.e., additive in some sense. Viterbi Algorithm - Assumptions

  23. See the http://en.wikipedia.org/wiki/Viterbi_algorithm. Viterbi Algorithm - Code

  24. Example - Using HMMs • Using HMMs to parse seminar announcements • Look for different features: Speaker, date, etc. • Could use one big HMM for all features or separate HMMs for each feature • Advantages: resistant to noise, can be trained from data, easily updated • Can be used to generate output as well as parse

  25. Example: HMM for speaker recog.

  26. Conditional Random Fields • HMM models the full joint probability of observations and hidden states – too much work • Instead, model the conditional probability of the hidden attributes given the observations • Given a text e1:N, find the hidden state sequence X1:N that maximizes P(X1:N|e1:N) • Conditional Random Field (CRF) does this • Linear Chain CRF: variables in temporal sequence

  27. Automated Template Construction • Start with examples of output, e.g., author-title pairs • Match over large corpus, noting order, and prefix, suffix, and intermediate text • Generate templates from the matches • Sensitive to noise

  28. Types of Grammars - Chomsky • Recursively Enumerable: unrestricted rules • Context-Sensitive: right-hand side must contain at least as many symbols as the left-hand side • Context-Free: The left-hand side contains a single symbol • Regular Expression: left-hand side is a single non-terminal, right-hand side is a terminal symbol optionally followed by a non-terminal symbol

  29. 1. sent <- np, vp. p(sent) = p(r1) * p(np) * p(vp). 2. np <- noun. p(np) = p(r2) * p(noun). .... 9. noun <- dog. p(noun) = p(dog). The probabilities are taken from a particular corpus of text. Probabilistic CFG

  30. 1. sent <- np(noun), vp(verb). p(sent) = p(r1) * p(np) * p(vp) * p(verb|noun). 2. np <- noun. p(np) = p(r2) * p(noun). .... 9. noun <- dog. p(noun) = p(dog). Note that we've introduced the probability of a particular verb given a particular noun. Probabilistic Lexicalized CFG

More Related