360 likes | 511 Vues
CMSC 723: Intro to Computational Linguistics. Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot. Regular Expressions and Finite State Automata. REs: Language for specifying text strings
E N D
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. DorrDr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot
Regular Expressions and Finite State Automata • REs: Language for specifying text strings • Search for document containing a string • Searching for “woodchuck” • How much wood would a woodchuckchuck if a woodchuck would chuck wood? • Searching for “woodchucks” with an optional final “s” • Finite-state automata (FSA)(singular: automaton)
Regular Expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions /[wW]oodchuck/
Regular Expressions • Ranges [A-Z] • Negations [^Ss]
*+ Stephen Cole Kleene Regular Expressions • Optional characters ? ,* and + • ? (0 or 1) • /colou?r/ colororcolour • * (0 or more) • /oo*h!/ oh! or Ooh! or Ooooh! • + (1 or more) • /o+h!/ oh! or Ooh! or Ooooh! • Wild cards . • /beg.n/ begin or began or begun
Regular Expressions • Anchors ^ and $ • /^[A-Z]/ “Ramallah, Palestine” • /^[^A-Z]/ “¿verdad?” “really?” • /\.$/ “It is over.” • /.$/ ? • Boundaries \b and \B • /\bon\b/ “on my way” “Monday” • /\Bon\b/ “automaton” • Disjunction | • /yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence • Column 1 Column 2 Column 3 …How do we express this? /Column [0-9]+ */ /(Column [0-9]+ *)*/ • Precedence • Parenthesis () • Counters * + ? {} • Sequences and anchors the ^my end$ • Disjunction | • REs are greedy!
Perl Commands While($line=<STDIN>){ if ($line =~ /the/){ print “MATCH: $line”; } }
Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Substitutions and Memory Substitute as many times as possible! • Substitutions s/colour/color/ s/colour/color/g Case insensitive matching s/colour/color/i • Memory (\1,\2, etc. refer back to matches) s/([0-9]+)/<\1>/ /the (.*)er they were, the \1er they will be/ /the (.*)er they (.*), the \1er they \2/
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions Step 1: replace first person references with second person references s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/\bI(’m| am)\b/YOU ARE/g s/\bmy\b/YOUR/g S/\bmine\b/YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
Finite-state Automata • Finite-state automata (FSA) • Regular languages • Regular expressions
baa! baaa! baaaa! baaaaa! ... /baa+!/ a b a a ! q0 q1 q2 q3 q4 finalstate state transition Finite-state Automata (Machines)
q0 a b a ! b a b a a ! 0 1 2 3 4 Input Tape REJECT
q4 q0 q1 q2 q3 q3 a b a a a b a a ! ! 0 1 2 3 4 Input Tape ACCEPT
Finite-state Automata • Q: a finite set of N states • Q = {q0, q1, q2, q3, q4} • : a finite input alphabet of symbols • = {a, b, !} • q0: the start state • F: the set of final states • F = {q4} • (q,i): transition function • Given state q and input symbol i, return new state q' • (q3,!) q4
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state transition-table [current-state, tape[index]]index index + 1end
! ! b ! b ! b b a a qF Adding a failing state a b a a ! q0 q1 q2 q3 q4
Adding an “all else” arc a b a a ! q0 q1 q2 q3 q4 = = = = qF
Languages and Automata • Can use FSA as a generator as well as a recognizer • Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. • L(M) = {baa!, baaa!, baaaa!, …} • Regular languages vs. non-regular languages
Languages and Automata • Deterministic vs. Non-deterministic FSAs • Epsilon () transitions
Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel
More about FSAs • Transducers • Equivalence of DFSAs and NFSAs • Recognition as search: depth-first, breadth-search
Regular languages • Regular languages are characterized by FSAs • For every NFSA, there is an equivalent DFSA. • Regular languages are closed under concatenation, Kleene closure, union.
Readings for next time • J&M Chapter 3