190 likes | 342 Vues
This lecture covers the fundamentals of Regular Expressions (Regex) and Finite Automata, crucial concepts in software design. It explores how regular expressions define languages through sets of strings. The lecture highlights various Regex rules including disjunction, Kleene closure, and wildcards, providing practical examples. Additionally, the session explains how finite automata recognize valid strings in a language, drawing parallels with compiler mechanisms. This comprehensive overview equips learners with essential tools for pattern matching and string manipulation in programming.
E N D
Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002
Regular Exp. Corresponding Set of Strings {""} a {"a"} (ab)* {"" , "ab", "abab", "ababab"} a | b | c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} Regular Expressions • Language = set of strings • Language is defined by a regular expression • the set of strings that match the expression. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regex rules (Perl) • Goal: match patterns • // String of characters matches the same string /woodchuck/ “how much wood does a woodchuck chuck?” /p/ “you are a good programmer” /pink elephant/ “this is not a pink elephant. /!/ “Keep you hands to yourself!” • []Disjunction /[wW]ood/ “how much wood does a Woodchuck chuck?” /[abcd]*/ “you are a good programmer” /[A-Za-z]*/ (any letter sequence) • Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT /[^A-Z]*/ (anything not an upper case letter) /a^b/ “look up a^b now”
Regex Rules, cont. • ? The preceding character or nothing /woodchucks?/ “how much wood does a woodchuck chuck?” /behaviou?r/ “behaviour is the British spelling” • * Kleene closure; zero or more occurrences of the preceding character or regular expression /baa*/ ba, baa, baaa, baaaa … /ba*/ b, ba, baa, baaa, baaaa … /[ab]*/ , a, b, ab, ba, baaa, aaabbb, … /[0-9][0-9]*/ any positive integer • + Kleene closure; one or more occurrences of the preceding character or regular expression /ba+/ ba, baa, baaa, baaaa … • . Wildcard; matches any character at that position /p.nt/ pant, pint, punt /cat.*cat/ A string where “cat” appears twice anywhere
Regex Rules, cont. • | Disjunction /(cats?|dogs?)+/ “It’s raining cats and a dog.” • ( ) Grouping /(gupp(y|ies))*/ “His guppy is the king of guppies.” • ^ $ \b Anchors (start, end of the line) /^The/ “The cat in the hat.” /^The end\.$/ “The end.” /^The .* end\.$/ “The bitter end.” /(the)*/ “I saw him the other day.” /(\bthe\b)*/ “I saw him the other day.”
Regexp for Dollars • No commas /$[0-9]+(\.[0-9][0-9])?/ • With commas /$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])?/ • With or without commas /$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?/
Regexps and Substitutions • s/title/<\1>/ title <title> • /the (.*)er they are, the \1er they will be/ The bigger they are, the bigger they will be. • /the (.*)er they (.*), the \1er they \2/ The bigger they were, the bigger they were Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
How to Simulate a Therapist • Eliza examples s/.* all .*/ IN WHAT WAY/ s/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* I am (happy | glad).*/I AM GLAD TO HEAR YOU ARE \1/ S/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/.*/TELL ME ABOUT YOUR MOTHER/ Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Practice: Regexp’s for email addresses Handle cases like these: hearst@sims.berkeley.edu Wacky@yahoo.com Vip@ic-arda.gov Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regexps and Scanners • Regular expressions are used to define the language recognized by the scanner for the parser • We create rules in which names stand for regular expressions • Example: • digit: [0-9] • letter: [A-Za-z]
Precedence of operators What is the difference? letter letter | digit* letter (letter | digit)* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Practicing Regexps • Describe (in English) the language defined by each of the following regular expressions: • letter (letter | digit*) • digit digit* "." digit digit* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Three Equivalent Representations Regular expressions Each can describe the others Finite automata Regular languages Adapted from Jurafsky & Martin 2000
Finite Automata • A FA is similar to a compiler in that: • A compiler recognizes legal programsin some (source) language. • A finite-state machine recognizes legal stringsin some language. • Example: Pascal Identifiers • sequences of one or more letters or digits, starting with a letter: letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
a Finite-Automata State Graphs • A state • The start state • An accepting state • A transition Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input • If in accepting state => accept • Otherwise => reject • If no transition possible (got stuck) => reject • FSA = Finite State Automata Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Language defined by FSA • The language defined by a FSA is the set of strings accepted by the FSA. • in the language of the FSM shown below: • x, tmp2, XyZzy, position27. • not in the language of the FSM shown below: • 123, a?, 13apples. letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Example: Integer Literals • FSA that accepts integer literals with an optional + or - sign: • Note – two different edges from S to A • \(+|-)?[0-9]+\ digit B digit digit + S A - Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html