1 / 21

Probabilistic Detection of Context-Sensitive Spelling Errors

Probabilistic Detection of Context-Sensitive Spelling Errors. Johnny Bigert Royal Institute of Technology, Sweden johnny@kth.se. What?. Context-Sensitive Spelling Errors Example: Nice whether today. All words found in dictionary

tuttler
Télécharger la présentation

Probabilistic Detection of Context-Sensitive Spelling Errors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden johnny@kth.se

  2. What? Context-Sensitive Spelling Errors • Example:Nice whether today. • All words found in dictionary • If context is considered,the spelling of whether is incorrect

  3. Why? Why do we need detection of context-sensitive spelling errors? • These errors are quite frequent (reports on 16-40% of all errors) • Larger dictionaries result in more errors undetected • They cannot be found by regular spell checkers!

  4. Why not? What about proposing corrections for the errors? • An interesting topic,but not the topic of this article • Detection is imperative,correction is an aid

  5. Related work? Are there no algorithms doing this already? • A full parser is perfect for the job Drawbacks: • high accuracy is required • not available for many languages • manual labor is expensive • not robust

  6. Related work? Are there no other algorithms? • Several other algorithms (e.g. Winnow) • Some do correction Drawbacks: • They require a set of easily confused words • Normally, you don’t know your spelling errors beforehand

  7. Why? What are the benefits of this algorithm? • Find any error • Avoid extensive manual work • Robustness

  8. How? Prerequisites • We use PoS tag trigram frequenciesfrom an annotated corpus • We are given a sentence, and apply a PoS tagger

  9. How? Basic assumption • If any tag trigram frequency is low, that part is probably ungrammatical

  10. But? But don’t you often encounter rare or unseen trigrams? • Yes, unfortunately • We modify the notion of frequency • Find and use other, ”syntactically close” PoS trigrams

  11. Close? What is the syntactic distance between two PoS tags? • A probability that one tag is replaceable by another • Retain grammaticality • Distances extracted from corpus • Unsupervised learning algorithm

  12. Then? The algorithm • We have a generalized PoS tag trigtram frequency • If frequency below threshold, text is probably ungrammatical

  13. Result? Summary so far • Unsupervised learning • Automatic algorithm • Detection of any error • No manual labor! • Alas, phrase boundaries cause problems

  14. Phrases? What about phrases? • PoS tag trigrams overlapping two phrases are very productive • Rare phrases, rare trigrams • Transformations!

  15. Transform? How do we transform a phrase? • Shallow parser • Transform phrases to most common form • Normally, the head • Benefits: retain grammaticality, less rare trigrams, longer tagger scope

  16. Example? Example of phrase transformation • Only the paintings that are old are for sale • Only the paintings are for sale NP NP

  17. Then what? How do we use the transformations? • Apply tagger to transformed sentence • Run first part of algorithm again • If any transformation yield only trigrams with high frequency,sentence ok • Otherwise, probable error

  18. Result? Summary • Trigram part, fully automatic • Phrase part, could use machine learning of rules for shallow parser • Finds many difficult error types • Threshold determines precision/recall trade-off

  19. Evaluation? Fully automatic evaluation • Introduce artificial context-sensitive spelling errors (using software Missplel) • Automated evaluation procedure for 1, 2, 5, 10 and 20% misspelled words(using software AutoEval)

  20. Results? 1% errors

  21. Results? 2% errors

More Related