Globalisation & Computer systems

Globalisation & Computer systems Week 8 • Finish Text processes part 1 • Searching strings and regular expressions • Practical: regular expressions in UNIX • Text processes part 2 • Spell checkers

Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

Regular Expressions • Basis of all web-based and word-processor-based searches

Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string

Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk

Regular Expressions • Find occurrences of /Nokia/ in the text

Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions • Suppress case distinctions • Nokia or nokia

Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

Regular Expressions • Suppress other features, for example singular share or plural shares

Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”

Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”

Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”

Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt

Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”

Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/

Spelling dictionaries • aim? • given a sequence of symbols: • 1. identify misspelled strings • 2. generate a list of possible ‘candidate’ correct strings • 3. select most probable candidate from the list

Spelling dictionaries • Implementation: • Probabilistic framework • bayesian rule • noisy channel model

Spelling dictionaries • Types of spelling error • actual word errors • non-word errors

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/ • of all errors in type written texts, 80% are non-word errors

Spelling dictionaries • non-word errors • Cognitive errors • /seperate/ instead of /separate/ • phonetically equivalent sequence of symbols has been substituted • due to lack of knowledge about spelling conventions

Spelling dictionaries • non-word errors • Cognitive errors • Typographic (‘typo’) errors • influenced by keyboard • e.g. substitution of /w/ for /e/ due to its adjacency on the keyboard • /thw/ instead of /the/

Spelling dictionaries • non-word errors • noisy channel model • The actual word has been passed through a noisy communication channel • This has distorted the word, thereby changing it in some way • The misspelled word is the distorted version of the actual word • Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

Spelling dictionaries • non-word errors • noisy channel model • What are the possible distortions? • insertion • deletion • substitution • transposition • all of these viewed as transformations that take place in the noisy channel

Spelling dictionaries • Implementing spelling identification and correction algorithm

Spelling dictionaries • Implementing spelling identification and correction algorithm • STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled • STAGE 2: generate list of candidates • Apply any single transformation to the typo string • Filter the list by checking against a dictionary • STAGE 3: assign probability values to each candidate in the list • STAGE 4: select best candidate

Spelling dictionaries • STAGE 3 • prior probability • given all the words in English, is this candidate more likely to be what the typist meant than that candidate? • P(c) = c/N where N is the number of words in a corpus • likelihood • Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? • P(t/c), calculated using a corpus of errors, or transformations • Bayesian rule: • get the product of the prior probability and the likelihood • P(c) X P(t/c)

Spelling dictionaries • non-word errors • Implementing spelling identification and correction algorithm • STAGE 1: identify misspelled words • STAGE 2: generate list of candidates • STAGE 3a: rank candidates for probability • STAGE 3b: select best candidate • Implement: • noisy channel model • Bayesian Rule

Next week Resources for globalisation • Machine translation • Translation memory

Globalisation & Computer systems