Understanding the Probabilistic Spell Checker Based on Noisy Channel Model
90 likes | 252 Vues
This lecture note outlines the principles of a probabilistic spell checker utilizing the Noisy Channel Model. It explores the formulation of the problem, where given a potentially misspelled word (t), the goal is to find the most probable correct word (w) that maximizes P(w|t). Key concepts include the application of Bayes' rule, the construction of a confusion matrix for capturing error patterns, and the computation of probabilities for insertion, deletion, substitution, and transposition errors. Understanding these elements is crucial for developing robust spelling correction systems.
Understanding the Probabilistic Spell Checker Based on Noisy Channel Model
E N D
Presentation Transcript
CS621/CS449Artificial IntelligenceLecture Notes Set 8 : 27/10/2004 CS-621/CS-449 Lecture Notes
Outline • Probabilistic Spell Checker (continued from Noisy Channel Model) • Confusion Matrix CS-621/CS-449 Lecture Notes
Probabilistic Spell Checker Noisy Channel Model • The problem formulation for spell checker is based on the Noisy Channel Model w t (wn, wn-1, … , w1) (tm, tm-1, … , t1) • Given t, find the most probable w : Find that ŵ for which P(w|t) is maximum, where t, w and ŵ are strings: Noisy Channel ŵ Guess at the correct word Correct word Wrongly spelt word CS-621/CS-449 Lecture Notes
Probabilistic Spell checker • Applying Bayes rule, • Why apply Bayes rule? • Finding p(w|t) Vs p(t|w) ? • P(w|t) or P(t|w) have to be computed by counting c(w,t) or c(t,w) and then normalizing them • Assumptions : • t is obtained from w by a single error of the above type. • The words consist of only alphabets ŵ CS-621/CS-449 Lecture Notes
Confusion Matrix Confusion Matrix: 26x26 • Data structure to store c(a,b) • Different matrices for insertion, deletion, substitution and transposition • Substitution • The number of instances in which a is wrongly substituted by b in the training corpus (denoted sub(x,y) ) CS-621/CS-449 Lecture Notes
Confusion Matrix • Insertion • The number of times a letter y is inserted after x wrongly( denoted ins(x,y) ) • Transposition • The number of times xy is wrongly transposed to yx ( denoted trans(x,y) ) • Deletion • The number of times y is deleted wrongly after x ( denoted del(x,y) ) CS-621/CS-449 Lecture Notes
Confusion Matrix • If x and y are alphabets, • sub(x,y) = # times y is written for x (substitution) • ins(x,y) = # times x is written as xy • del(x,y) = # times xy is written as x • trans(x,y) = # times xy is written as yx CS-621/CS-449 Lecture Notes
Probabilities • P(t|w) = P(t|w)S + P(t|w)I + P(t|w)D + P(t|w)X • Where P(t|w)S = sub(x,y) / count of x P(t|w)I = ins(x,y) / count of x P(t|w)D = del(x,y) / count of x P(t|w)X = trans(x,y) / count of x • These are considered to be mutually exclusive events CS-621/CS-449 Lecture Notes
Example • Correct document has ws • Wrong document has ts • P(maple|aple) = # (maple was wanted instead of aple) / # (aple) • P(apple|aple) and P(applet|aple) calculated similarly • Leads to problems due to data sparsity. • Hence, use Bayes rule. CS-621/CS-449 Lecture Notes