1 / 41

Globalisation & Computer systems

Globalisation & Computer systems. Week 8 Finish Text processes part 1 Searching strings and regular expressions Practical: regular expressions in UNIX Text processes part 2 Spell checkers. Searching. Text elements The objects of a text Depends on perspective

benard
Télécharger la présentation

Globalisation & Computer systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Globalisation & Computer systems Week 8 • Finish Text processes part 1 • Searching strings and regular expressions • Practical: regular expressions in UNIX • Text processes part 2 • Spell checkers

  2. Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

  3. Regular Expressions • Basis of all web-based and word-processor-based searches

  4. Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string

  5. Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

  6. Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk

  7. Regular Expressions • Find occurrences of /Nokia/ in the text

  8. Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

  9. Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

  10. Regular Expressions • Suppress case distinctions • Nokia or nokia

  11. Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

  12. Regular Expressions • Suppress other features, for example singular share or plural shares

  13. Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt

  14. Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

  15. Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”

  16. Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”

  17. Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”

  18. Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

  19. Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt

  20. Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”

  21. Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

  22. Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

  23. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/

  24. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/

  25. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/

  26. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/

  27. Spelling dictionaries • aim? • given a sequence of symbols: • 1. identify misspelled strings • 2. generate a list of possible ‘candidate’ correct strings • 3. select most probable candidate from the list

  28. Spelling dictionaries • Implementation: • Probabilistic framework • bayesian rule • noisy channel model

  29. Spelling dictionaries • Types of spelling error • actual word errors • non-word errors

  30. Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors

  31. Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/

  32. Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/ • of all errors in type written texts, 80% are non-word errors

  33. Spelling dictionaries • non-word errors • Cognitive errors • /seperate/ instead of /separate/ • phonetically equivalent sequence of symbols has been substituted • due to lack of knowledge about spelling conventions

  34. Spelling dictionaries • non-word errors • Cognitive errors • Typographic (‘typo’) errors • influenced by keyboard • e.g. substitution of /w/ for /e/ due to its adjacency on the keyboard • /thw/ instead of /the/

  35. Spelling dictionaries • non-word errors • noisy channel model • The actual word has been passed through a noisy communication channel • This has distorted the word, thereby changing it in some way • The misspelled word is the distorted version of the actual word • Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

  36. Spelling dictionaries • non-word errors • noisy channel model • What are the possible distortions? • insertion • deletion • substitution • transposition • all of these viewed as transformations that take place in the noisy channel

  37. Spelling dictionaries • Implementing spelling identification and correction algorithm

  38. Spelling dictionaries • Implementing spelling identification and correction algorithm • STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled • STAGE 2: generate list of candidates • Apply any single transformation to the typo string • Filter the list by checking against a dictionary • STAGE 3: assign probability values to each candidate in the list • STAGE 4: select best candidate

  39. Spelling dictionaries • STAGE 3 • prior probability • given all the words in English, is this candidate more likely to be what the typist meant than that candidate? • P(c) = c/N where N is the number of words in a corpus • likelihood • Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? • P(t/c), calculated using a corpus of errors, or transformations • Bayesian rule: • get the product of the prior probability and the likelihood • P(c) X P(t/c)

  40. Spelling dictionaries • non-word errors • Implementing spelling identification and correction algorithm • STAGE 1: identify misspelled words • STAGE 2: generate list of candidates • STAGE 3a: rank candidates for probability • STAGE 3b: select best candidate • Implement: • noisy channel model • Bayesian Rule

  41. Next week Resources for globalisation • Machine translation • Translation memory

More Related