Mastering Regular Expressions: Learning Effective Pattern Matching Techniques

Regular Expression (1) Learning Objectives: To understand the concept of regular expression To learn commonly used operations involving regular expression / pattern matching To learn the special cases occurred in regular expression / pattern matching

Simple Uses of Regular Expressions • In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } • What is tested in the if-statement? Answer: $_. • Can you write a even shorter statement using &&?

Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } • The previous example tests only one line, and prints out the line if it contains Shakespeare. • To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; } }

Simple Uses of Regular Expressions • What if we are not sure how to spell Shakespeare? • Certainly the first part is easy Shak, and there must be a r near the end. • How can we express our idea? grep: grep "Shak.*r" movie > result Perl: while(<>){ if(/Shak.*r/){ print; } } • .* means “zero or more of any character”.

Single-Character Patterns • The dot “.” matches any single character except the newline (\n). • For example, the pattern /a./ matches any two-letter sequence that starts with a and is not “a\n”. • Use \. if you really want to match the period. $ cat test hi hi bob. $ cat sub3 test #!/usr/local/bin/perl5 -w while(<>){ if(/\./){ print; } } $ sub3 test hi bob. $

Single-Character Groups (1) • If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiouAEIOU]/ matches any of the 5 vowels in either upper or lower case.

Single-Character Groups (2) • If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/ # matches [abcde] + ] /[abcde\]]/ # okay /[]abcde]/ # also okay • Use - for ranges of characters (like a through z): /[0123456789]/ # any single digit /[0-9]/ # same • If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/ # matches X, Y, Z /[X\-Z]/ # matches X, -, Z /[XZ-]/ # matches X, Z, - /[-XZ]/ # matches -, X, Z

Single-Character Groups (3) • More range examples: /[0-9\-]/ # match 0-9, or minus /[0-9a-z]/ # match any digit or lowercase letter /[a-zA-Z0-9_]/ # match any letter, digit, underscore • There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^0123456789]/ # match any single non-digit /[^0-9]/ # same /[^aeiouAEIOU]/ # match any single non-vowel /[^\^]/ # match any single character except ^

Single-Character Groups (4) • For convenience, some common character groups are predefined: Predefined Group Negated Negated Group \d (a digit) [0-9] \D (non-digit) [^0-9] \w (word char) [a-zA-Z0-9_] \W (non-word) [^a-zA-Z0-9_] \s (space char) [ \t\n] \S (non-space) [^ \t\n] • \d matches any digit • \w matches any letter, digit, underscore • \s matches any space, tab, newline • You can use these predefined groups in other groups: /\da-fA-F/ # match any hexadecimal digit

Split (1) • The split function allows you to break a string into fields. • split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split1 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with Bill Gates"; @fields = split(/ /,$line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split1 Bill love Gates $

Split (2) • You can use $_ with split. • split defaults to look for space delimiters. $ cat split2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; @fields = split; # split $line using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split2 Bill love Gates $

Pattern Memory (1) • How would we match a pattern that starts and ends with the same letter or word? • For this, we need to remember the pattern. • Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). • To recall memory, include a backslash followed by an integer. /Bill(.)Gates\1/

Pattern Memory (2) • Example: /Bill(.)Gates\1/ This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. • So, it matches: Bill!Gates! Bill-Gates- but not: Bill?Gates! Bill-Gates_ (Note that /Bill.Gates./ would match all four)

Pattern Memory (3) • More examples: /a(.)b(.)c\2d\1/ • This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. • So it matches: a-b!c!d-.

Pattern Memory (4) • The reference part can have more than a single character. • For example: /a(.*)b\1c/ • This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. • So it matches: aBillbBillc and abc, but not: aBillbBillGatesc.

Or • How about picking from a set of alternatives when there is more than one character in the patterns. • The following example matches either Gates or Clinton or Shakespeare: /Gates|Clinton|Shakespeare/ • For single character alternatives, /[abc]/ is the same as /a|b|c/.

Anchoring Patterns • Anchors requires that the pattern be at the beginning or end of the line. • ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ # match lines that begin with Bill /^Gates/ # match lines that begin with Gates /Bill\^/ # match lines containing Bill^ somewhere /\^/ # match lines containing ^ • $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ # match lines that end with Bill /Gates$/ # match lines that end with Gates /$Bill/ # match with contents of scalar $Bill /\$/ # match lines containing $

Using =~ (1) • What if you want to match a different variable than $_? • Answer: Use =~. • Examples: $name = "Bill Shakespeare"; $name =~ /^Bill/; # true $name =~ /(.)\1/; # also true (matches ll) if($name =~ /(.)\1/){ print "$name\n"; }

Using =~ (2) • An example using =~ to match <STDIN>: $ cat match1 #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^[yY]/){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1 Quit (y/n)? y Quitting $

Ignoring Case • In the previous examples, we used [yY] and [nN] to match either upper or lower case. • Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match1a #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^y/i){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1a Quit (y/n)? Y Quitting $

Slash and Backslash • If your pattern has a slash character (/), you must precede each with a backslash (\): $ cat slash1 #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ /^\/usr\/local\/bin/){ print "Path is /usr/local/bin\n"; } $ slash1 Enter path: /usr/local/bin Path is /usr/local/bin $

Different Pattern Delimiters • If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern# • The # can be any non-alphanumeric character. $ cat slash1a #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ m#^/usr/local/bin#){ # if($path =~ m@^/usr/local/bin@){ # also works print "Path is /usr/local/bin\n"; } $ slash1a Enter path: /usr/local/bin Path is /usr/local/bin $

Special Read-Only Variables (1) • After a successful pattern match, the variables $1, $2, $3,… are set to the same values as \1, \2, \3,… • You can use $1, $2, $3,… later in your program. $ cat read1 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; /(\w+)\W+(\w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1\n"; $ read1 The first name of Shakespeare is Bill

Special Read-Only Variables (2) • You can also use $1, $2, $3,… by placing the match in a list context: $ cat read2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(\w+)\W+(\w+)/; print "The first name of $last is $first\n"; $ read2 The first name of Shakespeare is Bill

Special Read-Only Variables (3) • Other read-only variables: • $& is the part of the string that matched the pattern. • $` is the part of the string before the match • $’ is the part of the string after the match $ cat read3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`\n"; print "Match: $&\n"; print "After: $'\n"; $ read3 Before: Bill Shakespeare Match: in After: Love

Repeat {n} • /(fred){5,15}/ • Match from five to fifteen repetitions of “fred” • /a{5,}/ • Match five or more times repetitions of “a” • /\w{8}/ • Match exactly 8 word characters.

Mastering Regular Expressions: Learning Effective Pattern Matching Techniques

Mastering Regular Expressions: Learning Effective Pattern Matching Techniques

Presentation Transcript

Regular Expression week 8

Matlab Regular Expression

Regular Expression 1. What is regular expression?

Regular Expression

Regular Expression

^Regular Expression$

Regular Expression - Intro

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Regular Expression

Chapter Eight: Regular Expression Applications

Regular Expression

Regular Expression Support

Sea Ice

Sea Ice