150 likes | 250 Vues
Learn about regular expressions step by step. Understand how to define powerful patterns for text processing tasks. Discover where and how to use them effectively in various tools and programming languages. Explore examples and advanced features.
E N D
Regular expressions step by step Tamás Váradi varadi@nytud.hu BTANT129 w6
What are they? • Regular expressions (regexp) define a pattern, which may match a whole series of strings • Powerful, compact, fast • Useful for all sorts of text processing tasks BTANT129 w6
Where can I use them? • In text editors/word processors (even in Ms Word to some extent!) like: • Textpad, EditPad Pro (to name but two) • Special programs to search a set of files: • grep, egrep, sed (free) • powergrep • Visual REGEXP • In programming languages • Perl, Python and other so-called script languages BTANT129 w6
What about INTEX? • Yes, INTEX has a built-in regexp facility • But it is a little limited and peculiar (INTEX offers graphs as an alternative) • In this lecture, we are going to cover regular expressions as used in the text processing tools mentioned above BTANT129 w6
Is there a standard variety? • More or less • There are variants that differ in • notation • features (expressive power, elegance etc) • Here we'll concentrate on what you can expect regular expressions to do BTANT129 w6
First things first • Any character will match itself • Except characters with a special meaning (metacharacters): \ | ( ) [ { ^ $ * + ? . < > • The pattern is applied from top to bottom left to right, as if a sliding window onto the text BTANT129 w6
Special characters • . will match any one character • ? will match the preceding character zero or once (at most once) • + will match the preceding character one or any number of times (at least once) • * will match the preceding character zero or any number of times • {n,m} BTANT129 w6
Examples • .at matches bat, cat, fat, pat, rat • c*at matches at and cat and ccat, cccat etc. • guess what c* will match and why? • c+at matches cat and ccat, cccat etc. but not at • c?at matches at and cat, BTANT129 w6
Anchor points • A regexp is matched against the text at any point where the first char of the regexp matches a char in the target text – a sliding window • matching is done line-by line by default • ^ : match at the beginning • $ : match at the end BTANT129 w6
Groups and alternations • (bla)* • Sir|Madam BTANT129 w6
Character classes • [aeiou] matches one of the set • [^aeiou] matches any other char except one in the set • [a-zA-Z0-9] consecutive characters can be referred to with a range • Note: whatever the length of the set, it always represents a single character in the pattern – so it's a single character alternation ('or' relation between characters BTANT129 w6
Extended features • \d a digit • \D a non-digit • \s a space, tab, linefeed, newline • \S a non-whitespace • \w a word-character • \W a non-wordcharacter • \b word-boundary • \n a newline • \t a tabulator BTANT129 w6
Longest vs. shortest match • When using quantifiers with non-literal characters (".","\w","\S" etc.) one can easily get unintended matches • .+ longest match (default) • .+? shortest match BTANT129 w6
The escape character • Problem:What if we want to find characters that are special metacharacters for regexp(\ | ( ) [ { ^ $ * + ? . < >) • Solution:They have to be preceded by "\" to strip them of their special value e.g.: • \( \$ \[ \? etc. BTANT129 w6
Things to do • Look up the tutorial athttp://www.zvon.org/other/PerlTutorial/Output/contents.html • Download one of the toolsVisualRegexp, Prowergrep,EditPad Proand experiment with texts • Follow the tutorial of EditPad Pro, which you can find in its Help BTANT129 w6