1 / 36

CMSC 723: Intro to Computational Linguistics

CMSC 723: Intro to Computational Linguistics. Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot. Regular Expressions and Finite State Automata. REs: Language for specifying text strings

tuyen
Télécharger la présentation

CMSC 723: Intro to Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. DorrDr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

  2. Regular Expressions and Finite State Automata • REs: Language for specifying text strings • Search for document containing a string • Searching for “woodchuck” • How much wood would a woodchuckchuck if a woodchuck would chuck wood? • Searching for “woodchucks” with an optional final “s” • Finite-state automata (FSA)(singular: automaton)

  3. Regular Expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions /[wW]oodchuck/

  4. Regular Expressions • Ranges [A-Z] • Negations [^Ss]

  5. *+ Stephen Cole Kleene Regular Expressions • Optional characters ? ,* and + • ? (0 or 1) • /colou?r/  colororcolour • * (0 or more) • /oo*h!/  oh! or Ooh! or Ooooh! • + (1 or more) • /o+h!/  oh! or Ooh! or Ooooh! • Wild cards . • /beg.n/  begin or began or begun

  6. Regular Expressions • Anchors ^ and $ • /^[A-Z]/  “Ramallah, Palestine” • /^[^A-Z]/  “¿verdad?” “really?” • /\.$/  “It is over.” • /.$/  ? • Boundaries \b and \B • /\bon\b/  “on my way” “Monday” • /\Bon\b/  “automaton” • Disjunction | • /yours|mine/  “it is either yours or mine”

  7. Disjunction, Grouping, Precedence • Column 1 Column 2 Column 3 …How do we express this? /Column [0-9]+ */ /(Column [0-9]+ *)*/ • Precedence • Parenthesis () • Counters * + ? {} • Sequences and anchors the ^my end$ • Disjunction | • REs are greedy!

  8. Perl Commands While($line=<STDIN>){ if ($line =~ /the/){ print “MATCH: $line”; } }

  9. Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

  10. A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

  11. Advanced operators

  12. Substitutions and Memory Substitute as many times as possible! • Substitutions s/colour/color/ s/colour/color/g Case insensitive matching s/colour/color/i • Memory (\1,\2, etc. refer back to matches) s/([0-9]+)/<\1>/ /the (.*)er they were, the \1er they will be/ /the (.*)er they (.*), the \1er they \2/

  13. Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

  14. Eliza-style regular expressions Step 1: replace first person references with second person references s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/\bI(’m| am)\b/YOU ARE/g s/\bmy\b/YOUR/g S/\bmine\b/YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

  15. Finite-state Automata • Finite-state automata (FSA) • Regular languages • Regular expressions

  16. baa! baaa! baaaa! baaaaa! ... /baa+!/ a b a a ! q0 q1 q2 q3 q4 finalstate state transition Finite-state Automata (Machines)

  17. q0 a b a ! b a b a a ! 0 1 2 3 4 Input Tape REJECT

  18. q4 q0 q1 q2 q3 q3 a b a a a b a a ! ! 0 1 2 3 4 Input Tape ACCEPT

  19. Finite-state Automata • Q: a finite set of N states • Q = {q0, q1, q2, q3, q4} • : a finite input alphabet of symbols •  = {a, b, !} • q0: the start state • F: the set of final states • F = {q4} • (q,i): transition function • Given state q and input symbol i, return new state q' • (q3,!)  q4

  20. State-transition Tables

  21. D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state  Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state  transition-table [current-state, tape[index]]index  index + 1end

  22. ! ! b ! b ! b b a a qF Adding a failing state a b a a ! q0 q1 q2 q3 q4

  23. Adding an “all else” arc a b a a ! q0 q1 q2 q3 q4 = = = = qF

  24. Languages and Automata • Can use FSA as a generator as well as a recognizer • Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. • L(M) = {baa!, baaa!, baaaa!, …} • Regular languages vs. non-regular languages

  25. Languages and Automata • Deterministic vs. Non-deterministic FSAs • Epsilon () transitions

  26. Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel

  27. Using NFSAs

  28. More about FSAs • Transducers • Equivalence of DFSAs and NFSAs • Recognition as search: depth-first, breadth-search

  29. Recognition using NFSAs

  30. NFSA Recognition of “baaa!”

  31. Breadth-first Recognition of “baaa!”

  32. Regular languages • Regular languages are characterized by FSAs • For every NFSA, there is an equivalent DFSA. • Regular languages are closed under concatenation, Kleene closure, union.

  33. Concatenation

  34. Kleene Closure

  35. Union

  36. Readings for next time • J&M Chapter 3

More Related