1 / 29

Outline

Outline. Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights Hand-coded Corpus-based estimation Dynamic Programming Shortest path . Detecting and Correcting Spelling Errors.

jania
Télécharger la présentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Applications: • Spelling correction • Formal Representation: • Weighted FSTs • Algorithms: • Bayesian Inference (Noisy channel model) • Methods to determine weights • Hand-coded • Corpus-based estimation • Dynamic Programming • Shortest path

  2. Detecting and Correcting Spelling Errors • Sources of lexical/spelling errors • Speech: lexical access and recognition errors (more later) • Text: typing and cognitive • OCR: recognition errors • Applications: • Spell checking • Hand-writing recognition of zip codes, signatures, Graffiti • Issues: • Correct non-words in isolation (dg for dog, why not dig?) • Correcting non-words could lead to valid words • Homophone substitution: “parents love there children”; “Lets order a desert after dinner” • Correcting words in context

  3. Patterns of Error • Human typists make different types of errors from OCR systems -- why? • Error classification I: performance-based: • Insertion: catt • Deletion: ct • Substitution: car • Transposition: cta • Error classification II: cognitive • People don’t know how to spell (nucular/nuclear; potatoe/potato) • Homonymous errors (their/there)

  4. Probability: Refresher • Population: 10 Princeton students • 4 vegetarians • 3 CS majors • What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4 • That a rcs is a CS major? p(c) = 0.3 • That a rcs is a vegetarian and CS major? p(c,v) = 0.2 • That a vegetarian is a CS major? p(c|v) = 0.5 • That a CS major is a vegetarian? p(v|c) = 0.66 • That a non-CS major is a vegetarian? p(v|c’) = ??

  5. Bayes Rule and Noisy Channel model • We know the joint probabilities • p(c,v) = p(c) p(v|c) (chain rule) • p(v,c) = p(c,v) = p(v) p(c|v) • So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c). • “Noisy channel” metaphor: channel corrupts the input; recover the original. • think cell-phone conversations!! • Hearer’s challenge: decode what the speaker said (w), given a channel-corrupted observation (O). Source model Channel model

  6. How do we use this model to correct spelling errors? • Simplifying assumptions • We only have to correct non-word errors • Each non-word (O) differs from its correct word (w) by one step (insertion, deletion, substitution, transposition) • Generate and Test Method: (Kernighan et al 1990) • Generate a word using one of substitution, deletion or insertion, transposition operations • Test if the resulting word is in the dictionary. • Example:

  7. How do we decide which correction is most likely? • Validate the generated word in a dictionary. • But there may be multiple valid words, how to rank them? • Rank them based on a scoring function • P(w | typo) = P(typo | w) * P(w) • Note there could be other scoring functions • Propose n-best solutions • Estimate the likelihood P(typo|w) and the prior P(w) • count events from a corpus to estimate these probabilities • Labeled versus Unlabeled corpus • For spelling correction, what do we need? • Word occurrence information (unlabeled corpus) • A corpus of labeled spelling errors • Approximate word replacement by local letter replacement probabilities: Confusion matrix on letters

  8. Cat vs Carat • Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus • cat occurs 6500 times, so p(cat) = .00013 • carat occurs 3000 times, so p(carat) = .00006 • Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word)) • suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500 times (p(-r)=.15) • Scoring function: p(word|typo) = p(typo|word) * p(word) • p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013 • p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009

  9. c:e,a:e,r:e,t:e Del c:c,a:a,r:r,t:t 0 c:c,a:a,r:r,t:t 0 c:c,a:a,r:r,t:t 0 e:c,e:a,e:r,e:t Ins c:a,c:r,c:t,a:c,a:t… Sub c a a r a t Encoding One-Error Correction as WFSTs • Let Σ = {c,a,r,t}; • One-edit model: • Dictionary model: • One-Error spelling correction: • Input ● Edit ● Dictionary t t

  10. Issues • What if there are no instances of carat in corpus? • Smoothing algorithms • Estimate of P(typo|word) may not be accurate • Training probabilities on typo/word pairs • What if there is more than one error per word?

  11. Minimum Edit Distance • How can we measure how different one word is from another word? • How many operations will it take to transform one word into another? caat --> cat, fplc --> fireplace (*treat abbreviations as typos??) • Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1) • Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent

  12. Computing Levinshtein Distance • Dynamic Programming algorithm • Solution for a problem is a function of the solutions of subproblems • d[i,j] contains the distance upto si and tj • d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations. • optimal edit operations is recovered by storing back-pointers.

  13. Edit Distance Matrix NB: errors Cost=1 for insertions and deletions; Cost=2 for substitutions Recompute the matrix: insertions=deletions=substituitions=1

  14. Levenstein Distance with WFSTs c:e,a:e,r:e,t:e Del • Let Σ = {c,a,r,t}; • Edit model: • The two sentences to compared are encoded as FSTs. • Levenstein distance between two sentences: • Dist(s1,s2) = s1 ● Edit ● s2 c:c,a:a,r:r,t:t 0 c:a,c:r,c:t,a:c,a:t… Sub e:c,e:a,e:r,e:t Ins

  15. Spelling Correction with WFSTs • Dictionary: FST representation of words • Isolated word spelling correction: • AllCorrections(w) = w ● Edit ● Dictionary • BestCorrection(w) = Bestpath (w ● Edit ● Dictionary) • Spelling correction in context: “parents lovetherechildren” • S = w1, w2, … wn • Spelling correction of wi • Generate possible edits for wi • Pick the edit that fits best in context • Use a n-gram language model (LM) to rank the alternatives. • “love there” vs “love their”; “there children” vs “their children” • SentenceCorrection (S) = F(S) ● Edit ● LM

  16. Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe. • Can humans understand ‘what is meant’ as opposed to ‘what is said/written’? • How? • http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/

  17. Summary • We can apply probabilistic modeling to NL problems like spell-checking • Noisy channel model, Bayesian method • Training priors and likelihoods on a corpus • Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems • e.g. Minimum Edit Distance algorithm • A number of Speech and Language tasks can be cast in this framework. • Generate alternatives using a generator • Select best/ Rank the alternatives using a model • If the generator and the model are encodable as FST • Decoding becomes • composition followed by search for best path.

  18. Word Classes and Tagging

  19. Word Classes and Tagging • Words can be grouped into classes based on a number of criteria. • Application independent criterion • Syntactic class (Nouns, Verbs, Adjectives…) • Proper names (People names, country names…) • Dates, currencies • Application specific criterion • Product names (Ajax, Slurpee, Lexmark 3100) • Service names (7-cents plan, GoldPass) • Tagging: Categorizing words of a sentence into one of the classes.

  20. Syntactic Classes in English: Open Class Words • Nouns: • Defined semantically: words for people, places, things • Defined syntactically: words that take determiners • Count nouns: nouns that can be counted • One book, two computers, hundred men • Mass nouns: nouns that represent homogenous groups, can occur without articles. • snow, salt, milk, water, hair • Proper nouns; common nouns • Verbs: words for actions and processes • Hit, love, run, fly, differ, go • Adjectives: words for describing qualities and properties (modifiers) of objects • White, black, old, young, good, bad • Adverbs: words for describing modifiers of actions • Unfortunately, John walked homeextremely slowly yesterday • Subclasses: locative (home), degree (very), manner (slowly), temporal (yesterday)

  21. Syntactic Classes in English: Closed Class Words • Closed Class words: • fixed set for a language • Typically high frequency words • Prepositions: relational words for describing relations among objects and events • In, on, before, by • Particles: looked up, throw out • Articles/Determiners: definite versus indefinite • Indefinite: a, an • Definite: the • Conjunctions: used to join two phrases, clauses, sentences. • Coordinating conjunctions: and, or, but • Subordinating conjunctions: that, since, because • Pronouns: shorthand to refer to objects and events. • Personal pronouns: he, she, it, they, us • Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s • Wh-pronouns: whose, what, who, whom, whomever • Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action • Tense: past, present, future • Aspect: completed or on-going • Polarity: negation • Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might) • Copula: “be” connects a subject to a predicate (John is a teacher) • Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).

  22. Tagset • Tagset: set of tags to use; depends on the application. • Basic tags; tags with some morphology • Composition of a number of subtags • Agglutinative languages • Popular tagsets for English • Penn Treebank Tagset: 45 tags • CLAWS tagset: 61 tags • C7 tagset: 146 tags • How do we decide how many tags to use? • Application utility • Ease of disambiguation • Annotation consistency • “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions • “TO” tag represents preposition “to” and infinitival marker “to read” • Supertags: fold in syntactic information into tagset • of the order of 1000 tags

  23. Tagging: Disambiguating Words • Three different models • ENGTWOL model (Karlsson et.al. 1995) • Transformation-based model (Brill 1995) • Hidden Markov Model tagger • ENGTWOL tagger • Constraint-based tagger • 1,100 hand-written constraints to rule out invalid combinations of tags. • Use of probabilistic constraints and syntactic information • Transformation-based model • Start with the most likely assignment • Make note of the context when the most likely assignment is wrong. • Induce a transformation rule that corrects the most likely assignment to the correct tag in that context. • Rules can be seen as α β | δ–γ • Compilable into an FST

  24. Noisy Channel Source Decoder Again, the Noisy Channel Model Input to channel: Part-of-speech sequence T • Output from channel: a word sequence W • Decoding task: find T’ = P(T|W) • Using Bayes Rule • And since P(W) doesn’t change for any hypothetical T’ • T’ = P(W|T) P(T) • P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual Probability

  25. Stochastic Tagging: Markov Assumption • The tagging model is approximated using Markov assumptions. • T’ = P(T) * P(W|T) • Markov (first-order) assumption: • Independence assumption: • Thus: • The probability distributions are estimated from an annotated corpus. • Maximum Likelihood Estimate • P(w|t) = count(w,t)/count(t) • P(ti|ti-1) = count(ti, ti-1)/count(ti-1) • Don’t forget to smooth the counts!! • There are other means of estimating these probabilities.

  26. Best Path Search • Search for the best path pervades many Speech and NLP problems. • ASR: best path through a composition of acoustic, pronunciation and language models • Tagging: best path through a composition of lexicon and contextual model • Edit distance: best path through a search space set up by insertion, deletion and substitution operations. • In general: • Decisions/operations create a weighted search space • Search for the best sequence of decisions • Dynamic programming solution • Sometimes the score is only relevant. • Most often the path (sequence of states; derivation) is relevant.

  27. Multi-stage decision problems The dog runs . • NN NNS EOS DT VB VBZ • BOS P(DT|BOS) =1 P(NN|DT) = 0.9 P(VB|DT) = 0.1 P(NNS|NN) = 0.3 P(VBZ|NN) = 0.7 P( |NNS) = 0.3 P( |VBZ) = 0.7 P(EOS | ) = 1 P(dog|NN) = 0.99 P(dog|VB) = 0.01 P(the|DT) = 0.999 P(runs|NNS) = 0.63 P(runs|VBZ) = 0.37 P( | ) = 0.999 P(NNS|VB) = 0.7 P(VBZ|VB) = 0.3 • • • • •

  28. Multi-stage decision problems The dog runs . • • Find the state sequence through this space that maximizes P(w|t)*P(t|t-1) • cost(BOS, EOS) = 1*cost(DT, EOS) • cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), • P(the|DT)*P(VB|DT)*cost(VB,EOS)} NN NNS EOS DT VB VBZ • BOS

  29. Two ways of reasoning • Forward approach (Backward reasoning) • Compute the best way to get from a state to the goal state. • Backward approach (Forward reasoning) • Compute the best way from the source state to get to a state. • A combination of these two approaches is used in unsupervised training of HMMs. • Forward-backward algorithm (Appendix D)

More Related