1 / 38

Regular Expression and Finite State Machine

Regular Expression and Finite State Machine. Based on Slides by Jim Martin. Regular Expressions and Text Searching. Everybody does it Emacs, vi, perl, grep, sed, awk, etc.. REs Character sequence Kleene star Character set, complement set Anchors Disjunction Grouping. Some Examples.

makani
Télécharger la présentation

Regular Expression and Finite State Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expression and Finite State Machine Based on Slides by Jim Martin

  2. Regular Expressions and Text Searching • Everybody does it • Emacs, vi, perl, grep, sed, awk, etc.. • REs • Character sequence • Kleene star • Character set, complement set • Anchors • Disjunction • Grouping

  3. Some Examples Courtesy of Kathy McCoy

  4. Courtesy of Kathy McCoy

  5. RE E.G. /pupp(y|ies)/ Morphological variants of ‘puppy’ / (.+)ier and \1ier / happier and happier, fuzzier and fuzzier Courtesy of Kathy McCoy

  6. Optionality and Repetition • /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou?r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3,} matches a sequence of at least 3 he’s Courtesy of Kathy McCoy

  7. Operator Precedence Hierarchy 1. Parentheses () 2. Counters * + ? {} 3. Sequence of Anchors the ^my end$ 4. Disjunction | Examples /moo+/ /try|ies/ /and|or/ Courtesy of Kathy McCoy

  8. A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

  9. A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

  10. A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /[Tt]he/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

  11. A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /\b[Tt]he\b/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

  12. The Two Kinds of Errors • The process we just went through was based on fixing errors in the regular expression • Errors where some of the instances were missed (judged to not be instances when they should have been) – False negatives • Errors where the instances were included (when they should not have been) – False positives • This is pretty much going to be the story of the rest of the course! Courtesy of Kathy McCoy

  13. Finite State Automata as Graphs • Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. • Let’s start with the sheep language from the text • /baa+!/

  14. Sheep FSA • We can say the following things about this machine • It has 5 states • At least b, a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions

  15. But note • There are other machines that correspond to this language • More on this one later

  16. Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

  17. Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational

  18. Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original

  19. Nouns and Verbs (English) • Nouns are simple (not really) • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb

  20. Regulars and Irregulars • Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

  21. Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut

  22. Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. • Quasi-systematicity • Irregular meaning change • Changes of word class

  23. Derivational Examples • Verb/Adj to Noun

  24. Derivational Examples • Noun/Verb to Adj

  25. Compute • Many paths are possible… • Start with compute • Computer -> computerize -> computerization • Computation -> computational • Computer -> computerize -> computerizable • Compute -> computee

  26. Stemming vs Morphology • Sometimes you just need to know the stem of a word and you don’t care about the structure. • In fact you may not even care if you get the right stem, as long as you get a consistent string. • This is stemming… it most often shows up in IR applications

  27. Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match • This is basically a form of hashing

  28. Porter Stemmer • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes • Doesn’t guarantee that the resulting stem is really a stem (see first bullet) • Lack of guarantee doesn’t matter for IR

  29. wear wear wearable wearabl wearer wearer wearied weari wearier wearier weariest weariest wearily wearili weariness weari wearing wear wearisome wearisom wearisomely wearisom wears wear weather weather weathercock weathercock weathercocks weathercock web web Webb webb Webber webber webs web Webster webster Websterville webstervil wedded wedd wedding wedd weddings wedd wedge wedg wedged wedg wedges wedg wedging wedg Porter Stemmer: Examples

  30. static RuleList step1a_rules[] = { {101, "sses", "ss", 3, 1, 0, NULL}, {102, "ies", "i", 2, 0, 0, NULL}, {103, "ss", "ss", 1, 1, 0, NULL}, {104, "s", LAMBDA, 0, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step1b_rules[] = { {105, "eed", "ee", 2, 1, 0, NULL}, {106, "ed", LAMBDA, 1, -1, -1, ContainsVowel}, {107, "ing", LAMBDA, 2, -1, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} };

  31. static RuleList step1b1_rules[] = { {108, "at", "ate", 1, 2, 0, NULL}, {109, "bl", "ble", 1, 2, 0, NULL}, {110, "iz", "ize", 1, 2, 0, NULL}, {111, "bb", "b", 1, 0, 0, NULL}, {112, "dd", "d", 1, 0, 0, NULL}, {113, "ff", "f", 1, 0, 0, NULL}, {114, "gg", "g", 1, 0, 0, NULL}, {115, "mm", "m", 1, 0, 0, NULL}, {116, "nn", "n", 1, 0, 0, NULL}, {117, "pp", "p", 1, 0, 0, NULL}, {118, "rr", "r", 1, 0, 0, NULL}, {119, "tt", "t", 1, 0, 0, NULL}, {120, "ww", "w", 1, 0, 0, NULL}, {121, "xx", "x", 1, 0, 0, NULL}, {122, LAMBDA, "e", -1, 0, 0, AddAnE}, {000, NULL, NULL, 0, 0, 0, NULL} };

  32. static RuleList step1c_rules[] = { {123, "y", "i", 0, 0, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step2_rules[] = { {203, "ational", "ate", 6, 2, 0, NULL}, {204, "tional", "tion", 5, 3, 0, NULL}, {205, "enci", "ence", 3, 3, 0, NULL}, {206, "anci", "ance", 3, 3, 0, NULL}, {207, "izer", "ize", 3, 2, 0, NULL}, {208, "abli", "able", 3, 3, 0, NULL}, {209, "alli", "al", 3, 1, 0, NULL}, {210, "entli", "ent", 4, 2, 0, NULL}, {211, "eli", "e", 2, 0, 0, NULL}, {213, "ousli", "ous", 4, 2, 0, NULL},

  33. static RuleList step3_rules[] = { {301, "icate", "ic", 4, 1, 0, NULL}, {302, "ative", LAMBDA, 4, -1, 0, NULL}, {303, "alize", "al", 4, 1, 0, NULL}, {304, "iciti", "ic", 4, 1, 0, NULL}, {305, "ical", "ic", 3, 1, 0, NULL}, {308, "ful", LAMBDA, 2, -1, 0, NULL}, {309, "ness", LAMBDA, 3, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} };

  34. static RuleList step4_rules[] = { {401, "al", LAMBDA, 1, -1, 1, NULL}, {402, "ance", LAMBDA, 3, -1, 1, NULL}, {403, "ence", LAMBDA, 3, -1, 1, NULL}, {405, "er", LAMBDA, 1, -1, 1, NULL}, {406, "ic", LAMBDA, 1, -1, 1, NULL}, {407, "able", LAMBDA, 3, -1, 1, NULL}, {408, "ible", LAMBDA, 3, -1, 1, NULL}, {409, "ant", LAMBDA, 2, -1, 1, NULL}, {410, "ement", LAMBDA, 4, -1, 1, NULL}, {411, "ment", LAMBDA, 3, -1, 1, NULL},

  35. Problems with Stemming

  36. Soundex • You work as a telephone information operator. Someone calls looking for our senior theory professor… • What do you type as your query string?

  37. Soundex • Keep the first letter • Drop non-initial occurrences of vowels, h, w and y • Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 • Replace strings of identical numbers with a single number (333 -> 3) • Drop any numbers beyond a third one

  38. Soundex • Effect is to map (hash) all similar sounding transcriptions to the same code. • Structure your directory so that it can be accessed by code as well as by correct spelling • Used for census records, phone directories, author searches in libraries etc.

More Related