210 likes | 331 Vues
This resource provides an overview of key concepts in computational linguistics, emphasizing the importance of regular expressions for string manipulation. It covers foundational topics such as phonetics, morphology, syntax, semantics, and pragmatics, before delving into the practical applications of regular expressions using Perl. Students will learn to match patterns, use character classes, apply anchors, and perform substitutions effectively. This guide is essential for efficiently managing and searching text data in linguistic research and applications.
E N D
CSE467/567Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo
Levels of processing • phonetics/phonology – sounds • morphology – word structure • syntax – sentence structure • semantics – meaning • pragmatics – goals of language use • discourse – utterances in context CSE 467/567
Words: the building blocks of sentences CSE 467/567
Words have internal structure • readable = read + able • readability = read + able + ity • the structure of words can be described using a regular grammar CSE 467/567
Chomsky hierarchy CSE 467/567
Problem • I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”. CSE 467/567
Regular expressions (in Perl) “a regular expression is an algebraic notation for characterizing a set of strings” [p. 22] Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files. CSE 467/567
Sequences of characters Matching a sequence of characters /…/ Examples: /a/ matches the character ‘a’ /fred/ matches the string ‘fred’ Note: /fred/ does not match the string ‘Fred’! In other words, patterns are case-sensitive. CSE 467/567
Character disjunction(character classes) Square brackets are used to indicate disjunction of characters. Examples: /[Ff]/ matches either ‘f’ or ‘F’ /[Ff]red/ matches either ‘fred’ or ‘Fred’ This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class. CSE 467/567
Ranges Sometimes it is useful to specify “any digit” or “any letter”. “Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern. An alternative is to use a special range notation: /[0-9]/ Any letter can be specified as /[A-Za-z]/ Range notation does not extend the power of regular expressions, but gives us a convenient way to express them. CSE 467/567
Complementing character classes To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets. Examples: /[^a]/ matches anything except ‘a’ /[^0-9]/ matches anything except a digit CSE 467/567
Matching 0 or 1 occurrence The ‘?’ matches zero or one occurrences of the preceding expression. Examples: /a?/ matches ‘a’ or ‘’ (nothing) /cats?/ matches ‘cat’ or ‘cats’ Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later. CSE 467/567
The Kleene star and plus The Kleene star (*) matches zero or more occurrences of the preceding expression. Examples: /a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc. /[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc. + matches one or more occurrences + is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/ CSE 467/567
Wildcard The period (.) matches any single character except the newline (\n). CSE 467/567
Anchors Anchors are used to restrict a match to a particular position within a string. ^ anchors to the start of a string $ anchors to the end of a string /[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’ \b anchors to a word boundary \B anchors to a non-boundary CSE 467/567
Conjunction Two regular expressions are conjoined by juxtaposition (placing the expressions side by side). Examples: /a/ matches ‘a’ /m/ matches ‘m’ /am/ matches ‘am’ but not ‘a’ or ‘m’ alone CSE 467/567
Disjunction We have already seen disjunction of characters using the square bracket notation General disjunction is expressed using the vertical bar (|), also called the pipe symbol. This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form. CSE 467/567
Grouping • Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern. • Ex: /[Gg](ee)|(oo)se/ CSE 467/567
Replacement In addition to matching, we can do replacements when a match is found: Example: To replace the British spelling of color with the American spelling, we can write: s/colour/color/ CSE 467/567
Registers – saving matches • To save a match from part of a pattern, to reuse it later on, Perl provides registers • Registers are named \#, where # is the number of the register • Ex. DE DO DO DO DE DA DA DA IS ALL I WANT TO SAY TO YOU /(D[AEO].)*/ will match the first line /(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically This pattern also matches strings like DA DE DE DE DA DO DO DO \s matches a whitespace character CSE 467/567
For more information • PERL Regular Expression TUTorial • http://perldoc.perl.org/perlretut.html • PERL Regular Expression reference page • http://perldoc.perl.org/perlre.html CSE 467/567