200 likes | 275 Vues
LING 388: Language and Computers. Sandiway Fong Lecture 6: 9/19. homework 2 acknowledgements mailed out today @ 2:45pm (no wifi: apologies for the delay) if you didn’t get an email, please resubmit will be reviewed on Thursday. this thursday lab class meet in SBS 224 homework 3.
E N D
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/19
homework 2 acknowledgements mailed out today @ 2:45pm (no wifi: apologies for the delay) if you didn’t get an email, please resubmit will be reviewed on Thursday this thursday lab class meet in SBS 224 homework 3 Administrivia
Today’s Topic • Regular Expressions (RE)
FSA Regular Expressions Regular Grammars Regular Expressions • (formally) equivalent to • finite state automata (FSA), and • regular grammars • used in • string pattern matching • typically for a single word form • search text: unix (e)grep, perl, microsoft word • caution: • differences in notation and implementation
Regular Expressions • Regular Expressions shorthand for describing sets of strings • String • sequence of zero or more characters • (typically, unbroken by spaces) • Examples • aaa • john • mary45 • mary 45 • NT$ • (empty string)
Regular Expressions • Regular Expressions • shorthand • stringn • exactly n occurrences of string • n = 0,1,2,3,... • examples • a4 b3 = aaaabbb • (uv)2 = uvuv • ((ab)2(ba)2)2 = ababbabaababbaba • Note: • parentheses are used to group sequences of characters (strings)
Regular Expressions shorthand for describing sets of strings string+ set of one or more occurrences of string i.e. the set {string1, string2, string3, ... } Note: set is infinite examples a+ = {a, aa, aaa, aaaa, aaaaa, …} (abc)+ = {abc, abcabc, abcabcabc, …} Regular Expressions
Regular Expressions shorthand for describing sets of strings string* set of zero or more occurrences of string i.e. the set {string0, string1, string2, string3, ... } string0= (the empty string) examples a* = {, a, aa, aaa, aaaa, …} (abc)* = {, abc, abcabc, …} Note: a a* = a+ a {, a, aa, aaa, aaaa, …} = {a , aa, aaa, aaaa, aaaaa, …} Regular Expressions Language = a set of strings
Wildcard Characters matches a range of characters .(period) matches any single character examples .+ed = set of all strings of length 3 or greater containing ed and having at least one character preceding it worked bed pre-education ed education .*fix = set of all strings of length 3 or greater containing fix prefix infix infixed suffix fix Regular Expressions
Regular Expressions • Wildcard Characters matches a range of characters [characters] (list of matching characters) matches any single character in the list • examples • [s,z]ation • organization • organisation • [a-z] • any character in the • range lowercase a to z • Note: not uppercase • [0-9] • any digit ASCII chart: computers only understand numbers American Standard Code for Information Interchange.
Regular Expressions • One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP • name comes from Unix ed command g/re/p • “search globally for lines matching the regular expression, and print them” • [Source: http://en.wikipedia.org/wiki/Grep]
excerpts from the manpage The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbol \b matches the empty string at the edge of a word The symbols \< and \> respectively match the empty string at the beginning and end of a word. terminology word unbroken sequence of digits, underscores and letters Regular Expressions: grep
Regular Expressions: grep • Excerpts from the manpage • A regular expression may be followed by one of several repetition operators: • ? The preceding item is optional and matched at most once. • * The preceding item will be matched zero or more times. • + The preceding item will be matched one or more times. • {n} The preceding item is matched exactly n times • {n,} The preceding item is matched n or more times. • {n,m} The preceding item is matched at least n times, but not more than m times.
concatenation Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. Regular Expressions: GNU grep Excerpts from the manpage
Regular Expression gupp(y|ies) examples guppy guppies Regular Expression beds? examples bed beds Regular Expressions: Examples
Regular Expressions: Examples • Example • \b99 matches • 99 in “there are 99 bottles …” • but not in • 99 in “there are 299 bottles …” • Note: • $99 contains two words, so \b99 will match 99 here • word • unbroken sequence of digits, underscores and letters
Regular Expressions: Examples • Example (sheeptalk) • ba! • baa! • baaa! … • regular expression • baa*! • ba+!
Regular Expressions: Microsoft Word • terminology: • wildcard search
Next Time • In the lab, we’ll be doing some regular expression exercises using Microsoft Word