1 / 32

Chapter 11: Regular Expressions and Matching The match operator has the following form.

Chapter 11: Regular Expressions and Matching The match operator has the following form. m/ pattern / A pattern can be an ordinary string or a generalized string containing metacharacters . The binding operator , =~ , is used to "bind" the matching operator onto a string.

gari
Télécharger la présentation

Chapter 11: Regular Expressions and Matching The match operator has the following form.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 11: Regular Expressions and Matching • The match operator has the following form. • m/pattern/ • A pattern can be an ordinary string or a generalized string containing metacharacters. • The binding operator, =~, is used to "bind" the matching operator onto a string. • "yesterday" =~ m/yes/ • Here the pattern is an ordinary three character string. • The entire expression evaluates to a Boolean value, true (1) in this case since the pattern yes is a substring of "yesterday".

  2. Since matching expressions result in Boolean values, they are usually used in a conditional. • $str="yesterday"; • if($str =~ m/yes/) { • print "The pattern yes was found in $str.\n"; • } • For demonstration, we will usually only show the matching expression. • Example: • $str="yesterday"; • $str =~ m/ester/ #true • $str =~ m/Ester/ #false • $str =~ m/yet/ #false

  3. Some notes: • The !~ is the negated form of the match operator. It returns true if the matching action does not find the pattern in the string. We will more often use the matching operator. • if($response !~ m/yes/){ • print "yes was not found in your response.\n"; • } • The matching operator can be simplified syntactically. For example, the following two expressions are equivalent. • $str =~ m/yes/ • $str =~ /yes/

  4. The match operator can be bound not only onto string literals and variables, but also onto expressions that evaluate to strings. • $str1="wilde"; • $str2="beest"; • $str1.$str2 =~ /debe/ #true • Example: A server-side "platform sniff" done by matching against the HTTP_USER_AGENT environment variable. • This example features the first pattern which is not merely a sequence of characters. The match • $info =~ /(Unix|Linux)/ • is true of either Unix or Linux is a substring of whatever is stored in the $info variable. • See source file os.cgi.

  5. A regular expression is a set of rules which define a generalized string. • For simplicity we call regular expressions patterns. • The syntax for a pattern is /pattern/ . • A pattern is like a double quoted string in that variables are interpolated and escape sequences are interpreted. • But a pattern is much more powerful than a string and can contain wildcards, character classes, and quantifiers, just to name a few features which make patterns (regular expressions) much more general than ordinary strings.

  6. Metacharacters • Characters which have special meaning in patterns are called metacharacters. • [ ] ( ) { } | \ + ? . * ^ $ • If used literally inside a pattern, their special meaning must be escaped. • if($sentence =~ m/\?/){ • print "Your sentence seems to be a question.\n"; • }

  7. Normal characters • These include ordinary ASCII characters which are not metacharacters. • Normal characters include, letters, numbers, the underscore, and a few other characters such as @ % & = ; : , which are not reserved metacharacters in patterns. • Normal characters need not be escaped when testing for matches. • if($sentence =~ m/;/){ • print "Your sentence seems to contain an independent clause.\n"; • }

  8. Escaped characters • Escaping in patterns works just like escaping characters in ordinary strings.. • For example, \* stands for one *, and \( stands for one (. • The following tests whether $str contains the three character string"(b)". • $str =~ /\(b\)/ • Example values for $str which would yield true and false values in the above match. • true: "(b)" , "(a)(b)(c)" • false: "(ab)" , "( b )"

  9. Escape sequences that stand for one character • Some escaped characters stand literally for only one character, like escaped metacharacters. • Some stand for one invisible character, such as a whitespace character. Just like with ordinary strings \nstands for one newline character, and \t stands for one tab character. • The following tests whether $str contains two consecutive newline characters. • $str =~ /\n\n/ • true: "a\n\nb" , "a\n\n\n\tb" • false: "\na\n" , "a\n \nb"

  10. Escape sequences that stand for a class of characters • These represent only one character in a pattern, but that one character matches any character in the specified group.

  11. The following tests whether $str contains a four character sequence that looks like a year in the 1900s. • $str =~ /19\d\d/ • true: "1921" , "34192176" • false: "191a" , "34192-76" • The following tests whether $str contains a non-whitespace character. (i.e. Itis notthe empty string or merely a sequence of whitespace characters. ) • $str =~ /\S/ • true: "x" , "()" • false: "" , " ", "\n"

  12. Wildcard • A period . stands for any one character, except a newline. • The following tests whether $str contains a three character substring that is c and t with anything in between, except a newline. • $str =~ /c.t/ • true: "cat" , "arc&tangent" • false: "ct" , "cart" , "arc\ntangent"

  13. Escape sequences that match locations • These characters do not actually represent a character in a pattern. Rather, they represent locations within patterns.

  14. The following tests whether $str begins with T • $str =~ /\AT/ • true: "Tom" , "The beest" • false: "tom" , "AT&T" • The following tests whether $str begins with The . • $str =~ /\AThe/ • true: "Thelma" , "The beest" • false: "That" , "the beest" • The following tests whether $str contains the word cat but not as part of any bigger word. • $str =~ /\bcat\b/ • true: "cat" , "my cat" • false: "cats" , "concatenate"

  15. Note: When matching locations, the escape sequence does not "use up" a character. That is, an expression such as • $str =~ /ing\z/ • only tests for the three character string ing at the end of $str.

  16. Character Classes • Square brackets [] in a pattern define a class. • The whole class matches only one character, and only if the character belongs to the class. • The following tests whether $str contains a three-character string beginning with one of r, b, or c, and followed by at. • $str =~ /[rbc]at/ • true: "rat" , "bat" , "cat" , • "concatenate" , "battery" • false: "mat" , "at"

  17. The escape sequences \d, \w, and \s and their opposites can be used inside a class. • A dash (-) can be used between two characters to denote a range of characters. • For example, the class • [\dA-F] • stands for one character that is either a numeric digit or one of the upper case letters A-F. It is equivalent to [0123456789ABCDEF] • The following tests whether $str contains a two-digit hexadecimal number as formatted in query string encoding. • $str =~ /%[\dA-F][\dA-F]/ • true:"%0A" , "data=Hi,%0A%0Dmy name is..." • false: "%0a" , "%3"

  18. Alternatives • The | character serves like an or by creating alternatives. • The following tests whether $str contains any of the three patterns. • $str =~ /cat|dog|ferret/ • true: "cat" , "dog" , "ferret" , "my cat" • "cats and dogs" , "doggedly" • false: "hamster" , "dodge the cart" • The alternatives are tested from left to right. • The alternatives themselves can be more complicated patterns.

  19. Grouping and Capturing • Parentheses () are used for grouping in patterns. • The following tests whether $str contains one of the three alternatives, then a whitespace, then food. • $str =~ /(cat|dog|ferret) food/ • true: "cat food","dog food","ferret food" • "I like cat food and dog food" • false: "cats food", "rat food", "dogfood" • With several alternatives, it is often desirable to capture which of the alternatives caused the successful match. That is, a mere truth value indicating a match doesn't indicate which match actually occurred.

  20. The special, built-in variables $1, $2, $3, …automatically capture an alternative that provides a successful match. • $str = "Do you have ferret food?"; • $str =~ /(cat|dog|ferret) food/ • Here, $1 is assigned the value "ferret" since that alternative provides the match. The rest are empty. • If more than one match is present, only the left-most match is recorded since alternatives are processed from left to right. • $str = "Do you have dog food or ferret food?" ; • $str =~ /(cat|dog|ferret) food/ • Here, "dog" is assigned to $1, but $2 is empty even though there is a second match.

  21. Multiple groups can populate more of the special variables. • $str = "Purina cat chow"; • $str =~ /(cat|dog|ferret) (food|chow)/ • $1 is assigned the value "cat" and $2 is assigned the value "chow". Captured matches are assigned into the special variables starting from the left-most grouping of alternatives. • Groups can be collected into a larger group. • $str = "Purina cat chow"; • $str =~ /((cat|dog|ferret) (food|chow))/ • $1 is assigned "cat chow" , $2 is assigned "cat" , and $3 is assigned "chow". The left-most behavior is still observed.

  22. Note: After a successful match, the special capturing variables are global variables within the program. if ($data =~ /(cat|dog|ferret) (food|chow)/ ) { print "The match<b>$1 $2</b> was found."; } So if the $data is "Purina cat chow is now", then the print statement would generate: The match cat chow was found. As global variables, they will contain the captured matches throughout the rest of the program or until their values are replaced by data captured in other matches.

  23. Other special variables • There is some degree of "capturing" even when grouping is not used. • $` (prematch - that part before the match), • $& (match - the matched part) • $' (postmatch - the part after the match). • After this is executed • "I like cats and bats." =~ /[rbc]at/ • $& contains "cat" • $` contains "I like " • $' contains "s and bats" • In general, the original string is equivalent to the concatenation of the three special variables. $`. $&. $'

  24. Quantifiers • A quantifier is always put after the character (or class of characters) to be quantified. • /x+/ -- matches one or more x's in a row • /[aeiou]{3}/ -- matches any three vowels in a row • /c.*t/-- matches a c followed by a t with 0 or more of • any character in between

  25. The following tests whether $str contains at least one b character in between an a and c. • $str =~ /ab+c/ • true: "abc", "abbc" , "abbbc" , "aabcc" • false: "ac" , "aBc" • The following tests whether $str contains a sequence of exactly 3 b characters in between an a and c. • $str =~ /ab{3}c/ • true: "abbbc", "aabbbcc" • false: "abbc" , "abbbbc" • The following tests whether $str contains a sequence of at least 2 b characters in between an a and c. • $str =~ /ab{2,}c/ • true: "abbc", "abbbc" , "aabbbbcc" • false: "abc" , "aBBc"

  26. It gets interesting when quantifiers are mixed with the special character classes. • The following tests to see if $str contains an alphanumeric word (chunk of consecutive alphanumeric characters). • $str =~ /\w+/ • true: "beest", "1234" , "R2D2" , "x" , "##xyz##" • false: "####" , "" , " " • The following tests to see if $str contains one or more consecutive digits (i.e. is there an integer inside). • $str =~ /\d+/ • true: "1", "121 Elm. St." , "R2D2" , • "##1##" , "3.14" • false: "a" , "####" , "" , " "

  27. The following tests to see if $str contains a substring that looks like a (possibly negative) integer. That is, does $str contain zero or one – characters, followed by one or more consecutive digits. • $str =~ /-?\d+/ • true: "2", "-2" , "-3.14" , "3-21.7" • false: "xyx" , "x-y" , "4-x" • The following tests to see if there is at least one whitespace character in $str. • $str =~ /\s+/ • true: " ", " " , " xyy" , "The End" • false: "" , "xyz" , "TheEnd"

  28. The following matches any two digit hexadecimal number. That is, it matches any occurrence of two consecutive characters from the class [0123456789abcdefABCDEF]. • /[\da-fA-F]{2}/ • The quantified pattern is equivalent to the longer pattern/[\da-fA-F][\da-fA-F]/. • For the next example, suppose we have dates that are roughly formatted, but in the general form • month_name day_number, year • We wish to create a pattern capable of factoring out inconsistent formatting and capture the three date parts. For example, it should handle both dates below. • jan 1,2002 • MARCH 22, 02

  29. The following tests whether $date contains (a group of one or more letters, lower or upper-case), followed by one or more spaces, followed by (a group of one or more digits), followed by a comma and then zero or more spaces, followed by (a group of one or more digits). • $date =~ /([a-zA-Z]+)\s+(\d+),\s*(\d+)/ • Since there are three groups, the month is captured into $1, the day into $2, and the year in $3.

  30. Quantifiers are greedy by default • That means a quantified pattern will attempt to match as much as possible. ("Matching is greedy.") • The following expression tests for a < character, followed by one or more of anything (wildcard), followed by a > character. • "<h1>Title</h1>" =~ /<.+>/ • The quantifier's greedyness passes up "<h1>", which would otherwise be a match. So the pattern matches the whole string in this case.

  31. To overcome the greedyness (match as little as possible), an extra ? character is placed after the quantifier. • For example, to find HTML tags, the pattern <.+?>would be used.It basically says test for a < character followed by one or more of anything until the first > character is found. • The following would only match "<h1>". • "<h1>Title</h1>" =~ /<.+?>/

  32. Command modifiers • The behavior of the matching operator can be altered by using a command modifier, which is placed after the operator. • string_expression =~ /pattern/command_modifier • Case insensitive matching • The command modifier i specifies that the matching should be done in a case insensitive fashion. • if($str =~ /be/i) { • print "The string contains either be, Be, bE, or BE."; • }

More Related