Regular Expressions

Regular Expressions

1 #!/usr/bin/perl The matching operator takes two operands. The first is the regular expression (or matching pattern) to search for, which is placed between the slashes of the m// operator. The second operand is the string in which to search, which is assigned to the match operator using =~. 2 # Fig. 8.1: fig08_01.pl 3 # Simple matching example. 4 By default, regular expressions are case sensitive. Thus, a similar search for the string SNOW returns false, and the associated print statement does not execute. Uses the matching operator, m//, to search for the string snow inside variable $string. 5 use strict; Rather than searching for the literal characters “$pattern,” the matching operator interpolates the value of $pattern (the string and) into the search pattern. 6 use warnings; 7 8 my $string = 'It is winter and there is snow on the roof.'; 9 my $pattern = 'and'; 10 11 print "String is: '$string'\n\n"; 12 13 print "Found 'snow'\n" if $string =~ m/snow/; 14 15 print "Found 'SNOW'\n" if $string =~ m/SNOW/; 16 17 print "Found 'on the'\n" if $string =~ m/on the/; 18 19 print "Found '$pattern'\n" if $string =~ m/$pattern/; 20 21 print "Found '$pattern there'\n" if $string =~ m/$pattern there/; String is: 'It is winter and there is snow on the roof.' Found 'snow' Found 'on the' Found 'and' Found 'and there'

1 #!/usr/bin/perl 2 # Fig. 8.2: fig08_02.pl 3 # Substitution Example Place the pattern, denoting the string to be replaced, between the first two slashes of the substitution operator, s///. Between the second two slashes, place the substitution pattern that will replace the first pattern. 4 5 use strict; 6 use warnings; The global modifier, /g, at the end of the regular expression causes the substitution operator to replace every occurrence of the first pattern (planet) with the second pattern (world). 7 The assignment operator takes the value returned from the s/// operator and assigns it to $matches. The substitution operator can use delimiters other than /. Using the string currently stored in $_ with the substitution operator. 8 my $string = "Hello to the world"; 9 10 print "The original string is: \"$string\"\n"; 11 $string =~ s/world/planet/; 12 print "s/world/planet/ changes string: $string \n"; 13 14 our $_ = $string; 15 print "The original string is: \"$_\"\n"; 16 s/planet/world/; 17 print "s/planet/world/ changes string: $_ \n"; 18 19 print "The original string is: \"$_\"\n"; 20 s(world)(planet); 21 print "s(world)(planet) changes string: $string \n"; 22 23 $string = "This planet is our planet."; 24 print "$string\n"; 25 my $matches = $string =~ s/planet/world/g; 26 print "$matches occurrences of planet were changed to world.\n"; 27 print "The new string is: $string\n";

The original string is: "Hello to the world" s/world/planet/ changes string: Hello to the planet The original string is: "Hello to the planet" s/planet/world/ changes string: Hello to the world The original string is: "Hello to the world" s(world)(planet) changes string: Hello to the planet This planet is our planet. 2 occurrences of planet were changed to world. The new string is: This world is our world.

1 #!/usr/bin/perl 2 # Fig 8.3: fig08_03.pl 3 # Determine if a string has a digit. 4 5 use strict; This pattern (\d) is called a special character. It matches any digit. 6 use warnings; 7 8 my $string1 = "hello there"; 9 my $string2 = "this one has a 2"; 10 11 number1( $string1 ); 12 number1( $string2 ); 13 number2( $string1 ); 14 number2( $string2 ); 15 16 sub number1 17 { 18 my $string = shift(); 19 20 if ( $string =~ /\d/ ) { 21 print "'$string' has a digit.\n"; 22 } 23 else { 24 print "'$string' has no digit.\n"; 25 } 26 } 27

28 sub number2 29 { 30 my $string = shift(); Brackets ([]) enclose the character class to separate it from the surrounding pattern. Inside the brackets, a dash indicates a range. So, [0-9] matches any digit, like \d. 31 32 if ( $string =~ /[0-9]/ ) { 33 print "'$string' has a digit.\n"; 34 } 35 else { 36 print "'$string' has no digit.\n"; 37 } 38 } 'hello there' has no digit. 'this one has a 2' has a digit. 'hello there' has no digit. 'this one has a 2' has a digit.

1 #!/usr/bin/perl 2 # Fig. 8.5: fig08_05.pl 3 # Using alternation. 4 5 use strict; 6 use warnings; Searches $string to determine if it contains one of the strings stop, quit or end, and it does not contain not or don't. 7 8 my $string1 = "i think we should stop"; 9 my $string2 = "lets continue"; If the pattern matches, the condition is true and alright,we'refinished. is displayed; otherwise, ok,let'skeepgoing is displayed. 10 my $string3 = "i don't want to end"; 11 12 finish( $string1 ); 13 finish( $string2 ); 14 finish( $string3 ); 15 16 sub finish 17 { 18 my $string = shift(); 19 print "$string\n"; 20 21 if ( $string =~ /stop|quit|end/ && 22 $string !~ /not|don't/ ) { 23 print "alright, we're finished.\n"; 24 } 25 else { 26 print "ok, lets keep going.\n"; 27 } 28 }

i think we should stop alright, we're finished. lets continue ok, lets keep going. i don't want to end ok, lets keep going.

1 #!usr/bin/perl 2 # Fig. 8.6: fig08_06.pl 3 # Showing the dangers of using alternate without parentheses. We want to search for “hello” or “hi,” and then “there.” However, Perl interprets the space in the pattern the same as any other character. So, “hithere” is considered as one whole string. Thus, it is also considered as one option for the alternation operator. 4 5 use strict; 6 use warnings; 7 8 my $string1 = "hello"; In the second part of this example, the alternation expression hello|hi is separated from the rest of the pattern with parentheses. This pattern is the one that we wanted to match in the first place. 9 my $string2 = "hello there"; 10 my $string3 = "hi there"; 11 12 print "$string1\n$string2\n$string3\n"; 13 14 print "watch this:\n"; 15 16 print "1: how are you?\n" if ( $string1 =~ m/hello|hi there/ ); 17 print "2: how are you?\n" if ( $string2 =~ m/hello|hi there/ ); 18 print "3: how are you?\n" if ( $string3 =~ m/hello|hi there/ ); 19 20 print "now watch this:\n"; 21 22 print "1: how are you?\n" 23 if ( $string1 =~ m/(hello|hi) there/ ); 24 print "2: how are you?\n" 25 if ( $string2 =~ m/(hello|hi) there/ ); 26 print "3: how are you?\n" 27 if ( $string3 =~ m/(hello|hi) there/ );

hello hello there hi there watch this: 1: how are you? 2: how are you? 3: how are you? now watch this: 2: how are you? 3: how are you?

1 #!usr/bin/perl 2 # Fig. 8.7: fig08_07.pl 3 # Some quantifiers. 4 5 use strict; 6 use warnings; 7 The asterisk (*) quantifier tells the regular-expression engine to match any number of (including zero) matches of the preceding pattern. 8 my $string = "11000"; 9 10 change1( $string ); 11 change2( $string ); 12 change3( $string ); 13 14 $string = "1010001"; 15 16 change1( $string ); 17 change2( $string ); 18 change3( $string ); 19 20 sub change1 21 { 22 my $string = shift(); 23 print " Original string: $string\n"; 24 $string =~ s/1\d*1/22/; 25 print "After s/1\\d*1/22/: $string\n\n"; 26 } 27

28 sub change2 29 { 30 my $string = shift(); The plus (+) quantifier tells the engine to match one or more instances of a pattern. 31 print " Original string: $string\n"; 32 $string =~ s/1\d+1/22/; 33 print "After s/1\\d+1/22/: $string\n\n"; 34 } The question mark (?) quantifier tells the engine to match 0 or 1 instances of a pattern. 35 36 sub change3 37 { 38 my $string = shift(); 39 print " Original string: $string\n"; 40 $string =~ s/1\d?1/22/; 41 print "After s/1\\d?1/22/: $string\n\n"; 42 }

Original string: 11000 After s/1\d*1/22/: 22000 Original string: 11000 After s/1\d+1/22/: 11000 Original string: 11000 After s/1\d?1/22/: 22000 Original string: 1010001 After s/1\d*1/22/: 22 Original string: 1010001 After s/1\d+1/22/: 22 Original string: 1010001 After s/1\d?1/22/: 220001

1 #!usr/bin/perl 2 # Fig. 8.9: fig08_09.pl 3 # Greedy and non-greedy quantifiers. 4 5 use strict; When the quantifier is greedy (no ? after the quantifier), the dot will match as many characters as it possibly can, leaving the here to match at the end of the third sentence. 6 use warnings; When the quantifier is not greedy, (.*?), the period matches as little as possible, leaving the here to match at the end of the second sentence. 7 8 my $string1 = 9 "Hello there. Nothing here. There could be something here."; 10 my $string2 = $string1; 11 12 print "$string1\n"; 13 $string1 =~ s/N.*here\.//; 14 print "$string1\n"; 15 print "$string2\n"; 16 $string2 =~ s/N.*?here\.//; 17 print "$string2\n\n"; Hello there. Nothing here. There could be something here. Hello there. Hello there. Nothing here. There could be something here. Hello there. There could be something here.

1 #!usr/bin/perl 2 # Fig. 8.10: fig08_10.pl 3 # Testing the look behind assertion. 4 5 use strict; The look-behind assertion takes the form (?<=value1)value2 where we check to see if value1 occurred right before value2. 6 use warnings; 7 The first look-behind assertion (?<=i ) tests whether the string matched “i” right before it matched “be.” If so, “be” is replaced with “am.” 8 my $string1 = "i be hungry."; 9 my $string2 = "we be here."; 10 my $string3 = "he be where?"; 11 12 conjugate( $string1 ); 13 conjugate( $string2 ); 14 conjugate( $string3 ); 15 16 sub conjugate 17 { 18 my $string = shift; 19 print "$string\n"; 20 $string =~ s/(?<=i )be/am/; 21 $string =~ s/(?<=we )be/are/; 22 $string =~ s/(?<=he )be/is/; 23 print "$string\n"; 24 }

i be hungry. i am hungry. we be here. we are here. he be where? he is where?

1 #!usr/bin/perl 2 # Fig. 8.11: fig08_11.pl 3 # Using backreferencing to find palindromes. 4 This expression (\4) is called a backreference. In regular expressions, parentheses capture bits of a string that can be referenced later in a pattern with a \, followed by a number that indicates the set of parentheses that captured the value. 5 use strict; 6 use warnings; 7 8 my $string1 = "madam im adam"; 9 my $string2 = "the motto means something"; 10 my $string3 = "no palindrome here"; 11 12 findPalindrome( $string1 ); 13 findPalindrome( $string2 ); 14 findPalindrome( $string3 ); 15 16 sub findPalindrome 17 { 18 my $string = shift(); 19 20 if ( $string =~ 21 /(\w)\W*(\w)\W*(\w)\W*(\w)\W*\4\W*\3\W*\2\W*\1/ 22 or $string =~ 23 /(\w)\W*(\w)\W*(\w)\W*(\w)\W*\3\W*\2\W*\1/ ) { 24 print "$string - ", 25 "has a palindrome of at least 7 characters.\n"; 26 } 27 else { 28 print "$string - has no long palindromes.\n"; 29 } 30 }

madam im adam - has a palindrome of at least 7 characters. the motto means something - has a palindrome of at least 7 characters. no palindrome here - has no long palindromes.

1 #!usr/bin/perl 2 # Fig. 8.12: fig08_12.pl 3 # Capitalize all sentences. 4 5 use strict; 6 use warnings; This regular expression searches for the places where a letter might need to be capitalized (the information that gets captured in $1), finds the letter to be capitalized (stored in $3) and capitalizes it (using \u). Next, we have the character class [a-z], which matches any lowercase letter. So, the whole pattern tells the engine to find a sentence-ending punctuation mark or the start of the string, followed by any amount of whitespace, followed by a lowercase letter. The character class is alternated with a \A, which matches the beginning of a string. 7 This class tells the regular-expression engine to search for a period, an exclamation point or a question mark. The first part will be captured in $1 and the second part in $3 (the punctuation or \A will be captured in $2, but will also get captured in $3). 8 my $string1 = "lets see. there should be two things capitalized."; 9 my $string2 = "This string is fine."; 10 my $string3 = "this could use some work. what needs to be fixed?"; 11 my $string4 = "yes! another string to be capitalized."; 12 my $string5 = "all done? yes."; 13 14 capitalize( $string1 ); 15 capitalize( $string2 ); 16 capitalize( $string3 ); 17 capitalize( $string4 ); 18 capitalize( $string5 ); 19 20 sub capitalize 21 { 22 my $string = shift(); 23 print "$string\n"; 24 $string =~ s/(([.!?]|\A)\s*)([a-z])/$1\u$3/g; 25 print "$string\n"; 26 }

lets see. there should be two things capitalized. Lets see. There should be two things capitalized. This string is fine. This string is fine. this could use some work. what needs to be fixed? This could use some work. What needs to be fixed? yes! another string to be capitalized. Yes! Another string to be capitalized. all done? yes. All done? Yes.

1 #!usr/bin/perl 2 # Fig. 8.13: fig08_13.pl 3 # Using the x modifier. 4 The /x modifier allows the programmer to add comments and extra whitespace into a pattern in the program’s source code. 5 use strict; 6 use warnings; 7 The substitution pattern is split over multiple lines. This format allows a programmer to use comments in the middle of a regular expression to explain complicated matching patterns. 8 my $string = "hello there. i am looking for a talking dog."; 9 10 print "$string\n"; 11 12 $string =~ s/ # start the pattern 13 talking # match talking 14 \040 # here is a space 15 dog\. # and then dog and a period 16 /what?/x; # replace it with 'what?' 17 print "$string\n"; hello there. i am looking for a talking dog. hello there. i am looking for a what?

1 #!usr/bin/perl 2 # Fig 8.14: fig08_14.pl 3 # Search perl code for variables. 4 This regular expression looks for a dollar sign (which needs to be escaped in the pattern) followed by some number of word characters. 5 use strict; 6 use warnings; 7 8 my $string = '$one $two @three $four @five $six $seven @eight'; 9 The /g modifier alters the position of the start of the match. Each time the loop executes, the matching operator finds a different substring that matches. 10 findScalar( $string ); 11 findArray( $string ); 12 13 sub findScalar 14 { 15 my $string = shift(); 16 17 while ( $string =~ m/\$(\w+)/g ) { 18 print "scalar name: $1\n"; 19 } 20 21 print "\n"; 22 } 23

24 sub findArray 25 { 26 my $string = shift(); 27 This regular expression looks for a @ followed by some number of word characters. 28 while ( $string =~ m/@(\w+)/g ) { 29 print "array name: $1\n"; 30 } 31 32 print "\n"; 33 } scalar name: one scalar name: two scalar name: four scalar name: six scalar name: seven array name: three array name: five array name: eight

1 <!DOCTYPE html PUBLIC"-//W3C//DTD HTML 4.0 Transitional//EN"> 2  This line specifies the form’s method as POST, and the action is to run the Perl script fig08_16.pl, a CGI script that processes the information sent from the form to the Web server. 3  4 5 <html> 6 <head> 7 <title>form page</title> 8 </head> 9 10 <body> 11 <p>here's my test form</p> Specify a submit and reset button for the form. 12 <form method = "post" action = "/cgi-bin/fig08_16.pl"> 13 14 <p>First name: 15 <input name = "firstName" type = "text" size = "20"></p> 16 17 <p>Last name: 18 <input name = "lastName" type = "text" size = "20"></p> 19 20 <p>Phone number: 21 <input name = "phone" type = "text" size = "20"></p> 22 23 <p>Date (MM/DD/YY): 24 <input name = "date" type = "text" size = "20"></p> 25 26 <p>Time (HH:MM:SS): 27 <input name = "time" type = "text" size = "20"></p> 28 29 <input type = "submit" value = "submit"> 30 <input type = "reset" value = "reset">

31 32 </form> 33 </body> 34 35 </html>

1 #!/usr/bin/perl 2 # Fig. 8.16: fig08_16.pl 3 # Form processing CGI program. 4 The parameters from the Web page are stored into variables that are used later in the code to formulate the part of the Web page that will be returned to the client. 5 use strict; The condition in this if structure executes if there are one or more words that make up the entire string (the words must be at the beginning and the end, because of the ^ and $ assertions). 6 use warnings; These two lines begin the document that will be returned to the client. 7 use CGI ':standard'; 8 The \L in this statement puts the remaining string in lowercase letters and the \u makes the letter right after the string uppercase. 9 my $firstName = param( "firstName" ); 10 my $lastName = param( "lastName" ); 11 my $phone = param( "phone" ); 12 my $date = param( "date" ); 13 my $time = param( "time" ); 14 15 print header(); 16 print start_html( -title => "form page" ); 17 18 if ( $firstName =~ /^\w+$/ ) { 19 print "<p>Hello there \L\u$firstName.</p>"; 20 } 21 22 if ( $lastName =~ /^\w+$/ ) { 23 print "<p>Hello there Mr./Ms. \L\u$lastName.</p>"; 24 } 25

26 if ( $phone =~ /^ # beginning of line We use ?: so that the value in the set of parentheses is not captured. The ?: does not apply to the nested parentheses in lines 30 and 33. 27 (?:1-?)? # optional 1- 28 (?: # start alternate Otherwise, we check the next half of the alternation. This part first determines if the user input an optional 0 followed by a digit from 1 through 9, denoting the first 9 months of the year. The result is stored in $1. Checks for a one followed by one of the digits 0, 1 or 2 (i.e., months 10, 11 and 12). This locates the numbers for the months October, November and December. If one of these numbers is found, its value is stored in $1. Otherwise, an attempt is made to match the other case, where the first three digits are captured and stored in $2. The first part captures three digits and stores them in $1 if the three digits are in parentheses. 29 $ # left paren The first part captures the first three digits of the phone number, and the second part captures the last four digits of the phone number, storing them in $3 and $4, respectively. 30 (\d{3}) # capture three digits Checks for a dash, which the user may or may not enter. For the year, we store two digits (\d\d) in $3. 31 $ # right paren 32 | # or Formats and outputs the area code and phone number. Work similarly to parse the time and format it for output. 33 (\d{3}) # capture three digits The ? checks for (at most) one digit in the beginning, this digit being 0, 1 or 2. This may not occur at all. 34 ) # end alternate Checks for the day of the month and stores it in $2. 35 -? # optional dash Formats and outputs the date. 36 (\d{3}) # capture three more digits 37 -? # optional dash 38 (\d{4}) # capture the final four digits 39 $/x ) # end of line, with x modifier 40 { 41 print "<p>Your phone number is ", $1 || $2 , " - $3 - $4.</p>"; 42 } 43 44 if ( $date =~ m#^(1[012]|0?[1-9])/([012]?\d|3[01])/(\d\d)$# ) { 45 print "<p>The date is $1 / $2 / $3.</p>"; 46 } 47 48 if ( $time =~ m#^(1[012]|[1-9]):([0-5]\d):([0-5]\d)$# ) { 49 print "<p>The time is $1 : $2 : $3.</p>"; 50 } 51 52 print end_html();

Regular Expressions