Regular Expression and Finite State Machine

Regular Expression and Finite State Machine Based on Slides by Jim Martin

Regular Expressions and Text Searching • Everybody does it • Emacs, vi, perl, grep, sed, awk, etc.. • REs • Character sequence • Kleene star • Character set, complement set • Anchors • Disjunction • Grouping

Some Examples Courtesy of Kathy McCoy

Courtesy of Kathy McCoy

RE E.G. /pupp(y|ies)/ Morphological variants of ‘puppy’ / (.+)ier and \1ier / happier and happier, fuzzier and fuzzier Courtesy of Kathy McCoy

Optionality and Repetition • /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou?r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3,} matches a sequence of at least 3 he’s Courtesy of Kathy McCoy

Operator Precedence Hierarchy 1. Parentheses () 2. Counters * + ? {} 3. Sequence of Anchors the ^my end$ 4. Disjunction | Examples /moo+/ /try|ies/ /and|or/ Courtesy of Kathy McCoy

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /[Tt]he/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /\b[Tt]he\b/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. Courtesy of Kathy McCoy

The Two Kinds of Errors • The process we just went through was based on fixing errors in the regular expression • Errors where some of the instances were missed (judged to not be instances when they should have been) – False negatives • Errors where the instances were included (when they should not have been) – False positives • This is pretty much going to be the story of the rest of the course! Courtesy of Kathy McCoy

Finite State Automata as Graphs • Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata. • Let’s start with the sheep language from the text • /baa+!/

Sheep FSA • We can say the following things about this machine • It has 5 states • At least b, a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions

But note • There are other machines that correspond to this language • More on this one later

Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational

Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original

Nouns and Verbs (English) • Nouns are simple (not really) • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb

Regulars and Irregulars • Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut

Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. • Quasi-systematicity • Irregular meaning change • Changes of word class

Derivational Examples • Verb/Adj to Noun

Derivational Examples • Noun/Verb to Adj

Compute • Many paths are possible… • Start with compute • Computer -> computerize -> computerization • Computation -> computational • Computer -> computerize -> computerizable • Compute -> computee

Stemming vs Morphology • Sometimes you just need to know the stem of a word and you don’t care about the structure. • In fact you may not even care if you get the right stem, as long as you get a consistent string. • This is stemming… it most often shows up in IR applications

Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match • This is basically a form of hashing

Porter Stemmer • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes • Doesn’t guarantee that the resulting stem is really a stem (see first bullet) • Lack of guarantee doesn’t matter for IR

wear wear wearable wearabl wearer wearer wearied weari wearier wearier weariest weariest wearily wearili weariness weari wearing wear wearisome wearisom wearisomely wearisom wears wear weather weather weathercock weathercock weathercocks weathercock web web Webb webb Webber webber webs web Webster webster Websterville webstervil wedded wedd wedding wedd weddings wedd wedge wedg wedged wedg wedges wedg wedging wedg Porter Stemmer: Examples

static RuleList step1a_rules[] = { {101, "sses", "ss", 3, 1, 0, NULL}, {102, "ies", "i", 2, 0, 0, NULL}, {103, "ss", "ss", 1, 1, 0, NULL}, {104, "s", LAMBDA, 0, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step1b_rules[] = { {105, "eed", "ee", 2, 1, 0, NULL}, {106, "ed", LAMBDA, 1, -1, -1, ContainsVowel}, {107, "ing", LAMBDA, 2, -1, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} };

static RuleList step1b1_rules[] = { {108, "at", "ate", 1, 2, 0, NULL}, {109, "bl", "ble", 1, 2, 0, NULL}, {110, "iz", "ize", 1, 2, 0, NULL}, {111, "bb", "b", 1, 0, 0, NULL}, {112, "dd", "d", 1, 0, 0, NULL}, {113, "ff", "f", 1, 0, 0, NULL}, {114, "gg", "g", 1, 0, 0, NULL}, {115, "mm", "m", 1, 0, 0, NULL}, {116, "nn", "n", 1, 0, 0, NULL}, {117, "pp", "p", 1, 0, 0, NULL}, {118, "rr", "r", 1, 0, 0, NULL}, {119, "tt", "t", 1, 0, 0, NULL}, {120, "ww", "w", 1, 0, 0, NULL}, {121, "xx", "x", 1, 0, 0, NULL}, {122, LAMBDA, "e", -1, 0, 0, AddAnE}, {000, NULL, NULL, 0, 0, 0, NULL} };

static RuleList step1c_rules[] = { {123, "y", "i", 0, 0, -1, ContainsVowel}, {000, NULL, NULL, 0, 0, 0, NULL} }; static RuleList step2_rules[] = { {203, "ational", "ate", 6, 2, 0, NULL}, {204, "tional", "tion", 5, 3, 0, NULL}, {205, "enci", "ence", 3, 3, 0, NULL}, {206, "anci", "ance", 3, 3, 0, NULL}, {207, "izer", "ize", 3, 2, 0, NULL}, {208, "abli", "able", 3, 3, 0, NULL}, {209, "alli", "al", 3, 1, 0, NULL}, {210, "entli", "ent", 4, 2, 0, NULL}, {211, "eli", "e", 2, 0, 0, NULL}, {213, "ousli", "ous", 4, 2, 0, NULL},

static RuleList step3_rules[] = { {301, "icate", "ic", 4, 1, 0, NULL}, {302, "ative", LAMBDA, 4, -1, 0, NULL}, {303, "alize", "al", 4, 1, 0, NULL}, {304, "iciti", "ic", 4, 1, 0, NULL}, {305, "ical", "ic", 3, 1, 0, NULL}, {308, "ful", LAMBDA, 2, -1, 0, NULL}, {309, "ness", LAMBDA, 3, -1, 0, NULL}, {000, NULL, NULL, 0, 0, 0, NULL} };

static RuleList step4_rules[] = { {401, "al", LAMBDA, 1, -1, 1, NULL}, {402, "ance", LAMBDA, 3, -1, 1, NULL}, {403, "ence", LAMBDA, 3, -1, 1, NULL}, {405, "er", LAMBDA, 1, -1, 1, NULL}, {406, "ic", LAMBDA, 1, -1, 1, NULL}, {407, "able", LAMBDA, 3, -1, 1, NULL}, {408, "ible", LAMBDA, 3, -1, 1, NULL}, {409, "ant", LAMBDA, 2, -1, 1, NULL}, {410, "ement", LAMBDA, 4, -1, 1, NULL}, {411, "ment", LAMBDA, 3, -1, 1, NULL},

Problems with Stemming

Soundex • You work as a telephone information operator. Someone calls looking for our senior theory professor… • What do you type as your query string?

Soundex • Keep the first letter • Drop non-initial occurrences of vowels, h, w and y • Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 • Replace strings of identical numbers with a single number (333 -> 3) • Drop any numbers beyond a third one

Soundex • Effect is to map (hash) all similar sounding transcriptions to the same code. • Structure your directory so that it can be accessed by code as well as by correct spelling • Used for census records, phone directories, author searches in libraries etc.

Regular Expression and Finite State Machine

Regular Expression and Finite State Machine

Presentation Transcript

Finite State Machine

Finite State Machine Continued

Chapter #8: Finite State Machine Design 8.5 Finite State Machine Word Problems

Finite State Machine

Finite State Machine Minimization

Regular Languages Regular Expressions Finite-State Automata

Regular Expressions and Finite State Automata

Lab 2 – Finite State Machine

Flip-flop and Finite State Machine

Finite State Machine

Finite State Machine: Realization

Finite State Machine

Chapter #8: Finite State Machine Design 8.1 - 8.2 Finite State Machine Design

Finite State Machine: Introduction

Finite State Machine

Finite Automata (Finite State Machine)

Regular Expressions and Finite State Automata

Finite State Machine (FSM)

Finite State Machine Analysis

Finite State Machine Design

The Finite State MacHine

Finite State Machine(FSM)