550 likes | 774 Vues
Regex is Fun. David Clawson SplunkYoda. Regular Expressions. “A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson QED Text Editor written by Ken in the 1970s Invented in the 1940s
E N D
Regex is Fun David Clawson SplunkYoda
Regular Expressions “A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson QED Text Editor written by Ken in the 1970s Invented in the 1940s Help celebrate it’s 70th Year
How is Regex used in Python? Python “re” Python's built-in "re" module provides excellent support for regular expressions, with a modern and complete regex flavor. The only significant features missing from Python's regex syntax are atomic grouping, possessive quantifiers, and Unicode properties. Using Regular Expressions in Python The first thing to do is to import the regexp module into your script with “import re”.
How is Regex used in Python? Call re.search(regex, subject) to apply a regex pattern to a subject string. The function returns None if the matching attempt fails, and a Match object otherwise. The Match object stores details about the part of the string matched by the regular expression pattern. Since None evaluates to False, you can easily use re.search()in an if statement.
How is Regex used in Python? Do not confuse re.search() with re.match(). Both functions do exactly the same, with the important distinction that re.search() will attempt the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string.
How is Regex used in Python? To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match attempt starts beyond the previous match. If the regex contains one or more capturing groups, re.findall() returns an array of tuples, with each tuple containing text matched by all the capturing groups. The overall regex match is not included in the tuple, unless you place the entire regex inside a capturing group.
How is Regex used in Python? More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match.
How is Regex used in Splunk? Field extraction | rex field=_raw “%UC_CALLMANAGER-(?<Severity>\d+)-EndPointUnregistered: Configure Line Breaking LINE_BREAKER = [\r\n]+ Filtering and Routing Data to Queues REGEX =(?m)^EventCode=(592|593) Many more…….
Regex Testing Tools • RegExrhttp://gskinner.com/RegExr/ • Reggyhttp://reggyapp.com/ • RegexPalhttp://regexpal.com/ • Regex Buddy http://www.regexbuddy.com/ • Lars Olav Torvikhttp://regex.larsolavtorvik.com/ • Rubularhttp://rubular.com/
Regex Reference Texts • http://www.regular-expressions.info/reference.html - from the creators of RegexBuddy • Introducing Regular Expressions by Michael Fitzgerald • Mastering Regular Expressions by Jeffrey Friedl • Regular Expressions Cookbook by Jan Goyvaerts • Regular Expressions Pocket Reference by Tony Stubblebine
Basic Concepts of Regular Expressions Because Knowing leads to Doing
Simple Pattern Matching Matching String Literals Matching Digits and Non-Digits Matching Word and Non-Word Characters Matching Whitespace Matching Any Character
Matching String Literals Sample Apache Log 10.23.10.11www.iamcool.com 10.100.0.11 - - [06/Dec/2012:14:39:03 -0800] "GET /Facelift/answers/swelling HTTP/1.1" 301 20 14932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” Literal String Match of the firstipaddresswould be: 10.23.10.11
Matching Digits and Non-Digits \d or \D or [0-9] \d - matchdigit \D – match non-digit(matcheswhitespace, punctuation and othercharacters not used in words) [0-9] - matchanynumber (called a characterclass) [^0-9] – matchany non-number
Matching Words and Non-Words \w or \W \w – matchanywordcharacter and isessentially the same as the characterclass [a-zA-Z0-9] \W – matchany non-wordcharacter
Matching Whitespace \s or \S \s – matchwhitespace (Spaces, Tabs, Line Feeds and CarriageReturns) \S – matchanycharacterthatis not whitespace. Same as [^\s]
Matching Any Character Dot (.) Matchesanycharacter but lineendingcharacters \b – matches a wordboundarywithoutconsuminganycharacters
Boundaries and Alternation Matching the Beginning and End of Line List of Regex Special Character Alternation and Regex Options Subpatterns Capturing and Named Groups Character Classes Negated Character Classes
Matching Beginning and End of Line ^ OR $ ^ - matches the beginning of a line $ - matches the end of a line
List of Regex Special Characters .^*+?|(){}[]\- . -matchesanycharacter ^ -matchesbeginning of the line * -matches zero ormore + -matches one ormore ? –matches one ormore | -used for alternation (choice of patterns to match) () –used for grouping {} –used as a quantifier [] –used with characterclasses \ -used to make a characterliteralor as a specialregexcharacter - -hyphenisused in a characterclassrange
Alternation and Options | OR ? | -gives choice of alternatepatterns to match, ie: (THE|The|the) (?i) – Case insensitive (?J) –allowduplicatenames (?m) –match on duplicate lines (?s) –match on a single line (?U) –matchlazy (?X) –Ignorewhitespace, comments (?-…) –Unsetorturn off options
Subpatterns Group(s) within a group (THE|The|the) -hasthreesubpatterns (tT)h(e|eir) –matches the, The, their, Their
Capturing and Named Groups () (?<name>…) OR (?P<name>…) Storetheircontent in memory (itis) (time to eat) $1 $2 (?<Severity>\d) Splunk creates a field of Severity from thisnamedgroup
Character Classes [] [aeiou] –onlymatches the charactersinside of the brackets [0-9] –matches a range of characters, using a hyphen [a-zA-Z0-9] –matchesallalphanumericcharacters
Negated Character Classes [^…] *** Super important – especially for Splunk field extractions *** [^aeiou] –matchesallconsonants and NOT vowels [^\s] – matcheverythingthatis not a space
Quantifiers Greedy, Lazy, Possessive Matching a certain number of times
Greedy, Lazy, Possessive * + ? * - match zero of moretimes .* -willmatchall of the characters in the subjecttext (want to avoidthis) + -match one ormore \d+ -matchall of the digitsuntiltherearen’tanymore - greedy ? –match0 or 1 of the preceedingtoken. colou?r –matcheseithercolororcolour
Matching a Certain Number of Times {} \d{3} -matches 3 digitsonly \d{1,3} –matchesrange of 1 to 3 digits \d{1,} -same as \d+ \d{0,} -same as \d* \d{0,1} -same as \d?
Optimized Regular Expressions Because fast is elegant!
Optimize Regular Expressions Capture groups add unnecessary overhead and impact overall performance use them only when necessary.
Optimize Regular Expressions Try to “factor” on the left, when you can, while exposing required text. Less alternation is better.
Optimize Regular Expressions Try to “factor” on the right when input text is close to end of the line. Most regex engines will anchor at end of line when “$” is present.
Optimize Regular Expressions Typically exposing required or literal text makes the engine execute the regex faster
Optimize Regular Expressions Useless parenthesis add unnecessary overhead. As above, use them only when necessary.
Optimize Regular Expressions The characterclass/set (indicatedby []) willaddunnecessary overheadwhennotneeded.
Optimize Regular Expressions Anchoring the regex at the beginning of the line will result in improved performance with most regex engines.
Optimize Regular Expressions I said, anchor the regex!
Optimize Regular Expressions Using a negated character class/set instead of lazy/greedy quantifiers will typically result in faster regexes. Lazy/greedy quantifiers will make the regex engines backtrack which ultimately impacts overall performance.
Optimize Regular Expressions Full alternation is more expensive than partial alternation. Also, in this case the regex engine will alternate only AFTER ‘bri’ has been matched.
Optimize Regular Expressions Leading the engine to a match by placing the most popular match first may result in faster execution in some engines.
Optimize Regular Expressions Specifyingan exactpositioninsidethestringandleadingthe engineto a match, will helpimproveperformancedrastically comparedtousing a simple greedy/lazyquantifier.
Optimize Regular Expressions If ‘a’ is near the end of the input string will match faster as less backtracking will be required.
Optimize Regular Expressions If ‘a’ is near the beginning of the input string the regex engine will match faster.
Optimize Regular Expressions Ex. in ‘ :destination’ the second regex fails faster.
Optimize Regular Expressions Same as above, using different notation. Explanation: Atomic grouping or possessive quantifiers instruct the regex engine not to keep the states captured by * or + therefore preventing it from unsuccessfully backtracking and in turn failing faster.