410 likes | 493 Vues
Explore how regular expressions form the basis of web and word processor searches, with examples of UNIX tools usage. Learn about text processes, symbolic sequences, and important operators like Kleene and disjunction. Discover the implementation of spelling dictionaries and common types of spelling errors. Enhance your understanding of text manipulation in computer systems. ####
E N D
Globalisation & Computer systems Week 8 • Finish Text processes part 1 • Searching strings and regular expressions • Practical: regular expressions in UNIX • Text processes part 2 • Spell checkers
Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects
Regular Expressions • Basis of all web-based and word-processor-based searches
Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string
Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk
Regular Expressions • Find occurrences of /Nokia/ in the text
Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions • Suppress case distinctions • Nokia or nokia
Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt
Regular Expressions • Suppress other features, for example singular share or plural shares
Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”
Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt
Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt
Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”
Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/
Spelling dictionaries • aim? • given a sequence of symbols: • 1. identify misspelled strings • 2. generate a list of possible ‘candidate’ correct strings • 3. select most probable candidate from the list
Spelling dictionaries • Implementation: • Probabilistic framework • bayesian rule • noisy channel model
Spelling dictionaries • Types of spelling error • actual word errors • non-word errors
Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors
Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/
Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/ • of all errors in type written texts, 80% are non-word errors
Spelling dictionaries • non-word errors • Cognitive errors • /seperate/ instead of /separate/ • phonetically equivalent sequence of symbols has been substituted • due to lack of knowledge about spelling conventions
Spelling dictionaries • non-word errors • Cognitive errors • Typographic (‘typo’) errors • influenced by keyboard • e.g. substitution of /w/ for /e/ due to its adjacency on the keyboard • /thw/ instead of /the/
Spelling dictionaries • non-word errors • noisy channel model • The actual word has been passed through a noisy communication channel • This has distorted the word, thereby changing it in some way • The misspelled word is the distorted version of the actual word • Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted
Spelling dictionaries • non-word errors • noisy channel model • What are the possible distortions? • insertion • deletion • substitution • transposition • all of these viewed as transformations that take place in the noisy channel
Spelling dictionaries • Implementing spelling identification and correction algorithm
Spelling dictionaries • Implementing spelling identification and correction algorithm • STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled • STAGE 2: generate list of candidates • Apply any single transformation to the typo string • Filter the list by checking against a dictionary • STAGE 3: assign probability values to each candidate in the list • STAGE 4: select best candidate
Spelling dictionaries • STAGE 3 • prior probability • given all the words in English, is this candidate more likely to be what the typist meant than that candidate? • P(c) = c/N where N is the number of words in a corpus • likelihood • Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? • P(t/c), calculated using a corpus of errors, or transformations • Bayesian rule: • get the product of the prior probability and the likelihood • P(c) X P(t/c)
Spelling dictionaries • non-word errors • Implementing spelling identification and correction algorithm • STAGE 1: identify misspelled words • STAGE 2: generate list of candidates • STAGE 3a: rank candidates for probability • STAGE 3b: select best candidate • Implement: • noisy channel model • Bayesian Rule
Next week Resources for globalisation • Machine translation • Translation memory