1 / 121

Workbook 8, and 9

Workbook 8, and 9. Pace Center for Business and Technology. String Processing Tools. Key Concepts The wc command counts the number of characters, words, and lines in a file. When applied to structured data, the wc command can become a versatile counting tool.

raya-walls
Télécharger la présentation

Workbook 8, and 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workbook 8, and 9 Pace Center for Business and Technology

  2. String Processing Tools Key Concepts • The wc command counts the number of characters, words, and lines in a file. When applied to structured data, the wc command can become a versatile counting tool. • The cat command has options that allow representation of nonprinting characters such as NEWLINE. • The head and tail commands have options that allow you to print only a certain number of lines or a certain number of bytes (one byte usually correlates to one character) from a file.

  3. Revisiting cat, head, and tail Revisiting cat We have been using the cat command to simply display the contents of files. Usually, the cat command generates a faithful copy of its input, without performing any edits or conversions. When called with one of the following command line switches, however, the cat command will indicate the presence tabs, line feeds, and other control sequences, using the following conventions. Using the -A command line switch, the whitespace structure of the file becomes evident, as tabs are replaced with ^I, and line feeds are decorated with $. E.g. cat -A /etc/hosts

  4. Revisiting head and tail For example, the following file contains a list of four musicians. Linux (and Unix) text files generally adhere to a convention that the last character of the file must be a line feed for the last line of text. Following the cat of the file musicians.mac, which does not contain any conventional Linux line feed characters, the bash prompt is not displayed in its usual location.

  5. Revisiting head and tail

  6. The wc (Word Count) Command When used without any command line switches, wc will report on the number of characters, lines, and words. Command line switches can be combined to return any combination of character count, line count or word count.

  7. How To Recognize A Real Character Text files are composed using an alphabet of characters. Some characters are visible, such as numbers and letters. Some characters are used for horizontal distance, such as spaces and TAB characters. Some characters are used for vertical movement, such as carriage returns and line feeds. A line in a text file is a series of any character other than a NEWLINE (line feed) character and then a NEWLINE character. Additional lines in the file immediately follow the first line. While a computer represents characters as numbers, the exact value used for each symbol varies depending on which alphabet has been chosen. The most common alphabet for English speakers is ASCII, also called “Latin-1”. Different human languages are represented by different computer encoding rules, so the exact numeric value for a given character depends on the human language being recorded.

  8. So, What Is A Word? A word is a group of printing characters, such as letters and digits, surrounded by white space, such as space characters or horizontal TAB characters. Notice that our definition of a word does not include any notion of “meaning”. Only the form of the word is important, not its semantics. As far as Linux is concerned, a line such as:

  9. Chapter 2.  Finding Text: grep Key Concepts • grep is a command that prints lines that match a specified text string or pattern. • grep is commonly used as a filter to reduce output to only desired items. • grep -r will recursively grep files underneath a given directory. • grep -v prints lines that do NOT match a specified text string or pattern. • Many other command line switches allow users to specify grep's output format.

  10. Searching Text File Contents using grep In an earlier Lesson, we saw how the wc program can be used to count the characters, words and lines in text files. In this Lesson we introduce the grep program, a handy tool for searching text file contents for specific words or character sequences. The name grep stands for general regular expression parser. What, you may well ask, is a regular expression and why on earth should I want to parse one? We will provide a more formal definition of regular expressions in a later Lesson, but for now it is enough to know that a regular expression is simply a way of describing a pattern, or template, to match some sequence of characters. A simple regular expression would be “Hello”, which matches exactly five characters: “H”, “e”, two consecutive “l” characters, and a final “o”. More powerful search patterns are possible and we shall examine them in the next section. The figure below gives the general form of the grep command line:

  11. Searching Text File Contents using grep The following table summarizes some of grep's more commonly used command line switches. Consult the grep(1) man page (or invoke grep --help) for more.

  12. Show All Occurrences of a String in a File Under Linux, there are often several ways of accomplishing the same task. For example, to see if a file contains the word “even”, you could just visually scan the file: Reading the file, we see that the file does indeed contain the letters “even”. Using this method on a large file suffers because we could easily miss one word in a file of several thousand, or even several hundred thousand, words. We can use the grep tool to search through the file for us in an automatic search: Here we searched for a word using its exact spelling. Instead of just a literal string, the pattern argument can also be a general template for matching more complicated character sequences; we shall explore that in a later Lesson.

  13. Searching in Several Files at Once An easy way to search several files is just to name them on the grep command line: Perhaps we are more interested in just discovering which file mentions the word “nine” than actually seeing the line itself. Adding the -l switch to the grep line does just that:

  14. Searching Directories Recursively Grep can also search all the files in a whole directory tree with a single command. This can be handy when working a large number of files. The easiest way to understand this is to see it in action. In the directory /etc/sysconfig are text files that contain much of the configuration information about a Linux system. The Linux name for the first Ethernet network device on a system is “eth0”, so you can find which file contains the configuration for eth0 by letting the grep -r command do the searching for you [11]:

  15. Searching Directories Recursively Every file in /etc/sysconfig that mentions eth0 is shown in the results. We can further limit the files listed to only those referring to an actual device by filtering the grep -r output through a grep DEVICE: This shows a common use of grep as a filter to simplify the outputs of other commands. If only the names of the files were of interest, the output can be simplified with the -l command line switch.

  16. Inverting grep By default, grep shows only the lines matching the search pattern. Usually, this is what you want, but sometimes you are interested in the lines that do not match the pattern. In these instances, the -v command line switch inverts grep's operation.

  17. Getting Line Numbers Often you may be searching a large file that has many occurrences of the pattern. Grep will list each line containing one or more matches, but how is one to locate those lines in the original file? Using the grep -n command will also list the line number of each matching line. The file /usr/share/dict/words contains a list of common dictionary words. Identify which line contains the word “dictionary”: You might also want to combine the -n switch with the -r switch when searching all the files below a directory:

  18. Limiting Matching to Whole Words Remember the file containing our nursery rhyme earlier? Suppose we wanted to retrieve all lines containing the word “at”. If we try the command: Do you see what happened? We matched the “at” string, whether it was an isolated word or part of a larger word. The grep command provides the -w switch to imply that the specified pattern should only match entire words. The -w switch considers a sequence of letters, numbers, and underscore characters, surrounded by anything else, to be a word.

  19. Ignoring Case The string “Bob” has quite a meaning quite different from the string “bob”. However, sometimes we want to find either one, regardless of whether the word is capitalized or not. The grep -i command solves just this problem.

  20. ExamplesFinding Simple Character Strings Verify that your computer has the system account “lp”, used for the line printer tools. Hint: the file /etc/passwd contains one line for each user account on the system.

  21. Chapter 3.  Introduction to Regular Expressions Key Concepts • Regular expressions are a standard Unix syntax for specifying text patterns. • Regular expressions are understood by many commands, including grep, sed, vi, and many scripting languages. • Within regular expressions, . and [] are used to match characters. • Within regular expressions, +, *, and ?specify a number of consecutive occurrences. • Within regular expressions, ^ and $ specify the beginning and end of a line. • Within regular expressions, (, ), and | specify alternative groups. • The regex(7) man page provides complete details.

  22. Introducing Regular Expressions In the previous chapter you saw grep used to match either a whole word or part of a word. This by its self is very powerful, especially in conjunction with arguments like -i and -v, but it is not appropriate for all search scenarios. Here are some examples of searches that the grep usage you've learned so far would not be able to do: First, suppose you had a file that looked like this:

  23. Introducing Regular Expressions What if you wanted to pull out just the names of the people in people_and_pets.txt? A command like grep -w Name: would match the 'Name:' line for each person, but also the 'Name:' line for each person's pet. How could we match only the 'Name:' lines for people? Well, notice that the lines for pets' names are all indented, meaning that those lines begin with whitespace characters instead of text. Thus, we could achieve our goal if we had a way to say "Show me all lines that begin with 'Name:'". Another example: Suppose you and a friend both witnessed a hit-and-run car accident. You both got a look at the fleeing car's license plate and yet each of you recalls a slightly different number. You read the license number as "4I35VBB" but your friend read it as "413SV88". It seems that what you read as an 'I' in the second character, your friend read as a '1'. Similar differences appear in your interpretations of other parts of the license like '5' vs 'S' and 'BB' vs '88'. The police, having taken both of your statements, now need to narrow down the suspects by querying their database of license plates for plates that might match what you saw.

  24. Introducing Regular Expressions One solution might be to do separate queries for "4I35VBB" and "413SV88" but doing so assumes that one of you is exactly right. What if the perpetrator's license number was actually "4135VB8"? In other words, what if you were right about some of the characters in question but your friend was right about others? It would be more effective if the police could query for a pattern that effectively said: "Show me all license numbers that begin with a '4', followed by an 'I' or a '1', followed by a '3', followed by a '5' or an 'S', followed by a 'V', followed by two characters that are each either a 'B' or an '8'". Query scenarios like these can be solved using regular expressions. While computer scientists sometimes use the term "regular expression" (or "regex" for short) to describe any method of describing complex patterns, in Linux and many programming languages the term refers to a very specific set of special characters used for solving problems like the above. Regular expressions are supported by a large number of tools including grep, vi, find and sed.

  25. Introducing Regular Expressions To introduce the usage of regular expressions, lets look at some solutions to two problems introduced earlier. Don't worry if these seem a bit complicated, the remainder of the unit will start from scratch and cover regular expressions in great detail. A regex that could solve the first problem, where we wanted to say "Show me all lines that begin with 'Name:'" might look like this: ...that's it! Regular expressions are all about the use of special characters, called metacharacters to represent advanced query parameters. The carat ("^"), as shown here, means "Lines that begin with...". Note, by the way, that the regular expression was put in single-quotes. This is a good habit to get into early on as it prevents bash from interpreting special characters that were meant for grep.

  26. Introducing Regular Expressions Ok, so what about the second problem? That one involved a much more complicated query: "Show me all license numbers that begin with a '4', followed by an 'I' or a '1', followed by a '3', followed by a '5' or an 'S', followed by a 'V', followed by two characters that are each either a 'B' or an '8'". This could be represented by a regular expression that looks like this: Wow, that's pretty short considering how long it took to write out what we were looking for! There are only two types of regex metacharacters used here: square braces ('[]') and curly braces ('{}'). When two or more characters are shown within square braces it means "any one of these". So '[B8]' near the end of the expression means "'B' or '8'". When a number is shown within curly braces it means "this many of the preceding character". Thus, '[B8]{2}' means "two characters that are each either a 'B' or an '8'". Pretty powerful stuff! Now that you've gotten a taste of what regular expressions are and how they can be used, let's start from scratch and cover them in depth.

  27. Regular Expressions, Extended Regular Expressions, and the grep Command As the Unix implementation of regular expression syntax has evolved, new metacharacters have been introduced. In order to preserve backward compatibility, commands usually choose to implement regular expressions, or extended regular expressions. In order to not become bogged down with the differences, this Lesson will introduce the extended syntax, summarizing differences at the end of the discussion. One of the most common uses for regular expressions is specifying search patterns for the grep command. As was mentioned in the previous Lesson, there are three versions of the grep command. Reiterating, the three differ in how they interpret regular expressions.

  28. Regular Expressions, Extended Regular Expressions, and the grep Command fgrep The fgrep command is designed to be a "fast" grep. The fgrep command does not support regular expressions, but instead interprets every character in the specified search pattern literally. grep The grep command interprets each patterns using the original, basic regular expression syntax. egrep The egrep command interprets each patterns using extended regular expression syntax. Because we are not yet making a distinction between the basic and extended regular expression syntax, the egrep command should be used whenever the search pattern contains regular expressions.

  29. Anatomy of a Regular Expression In our discussion of the grep program family, we were introduced to the idea of using a pattern to identify the file content of interest. Our examples were carefully constructed so that the pattern contained exactly the text for which we were searching. We were careful to use only literal characters in our regular expressions; a literal character matches only itself. So when we used “hello” as the regular expression, we were using a five-character regular expression composed only of literal characters. While this let us concentrate on learning how to operate the grep program, it didn't allow us to get a full appreciation of the power of regular expressions. Before we see regular expressions in use, we shall first see how they are constructed.

  30. Anatomy of a Regular Expression A regular expression is a sequence of: Literal Characters Literal characters match only themselves. Examples of literals are letters, digits and most special characters (see below for the exceptions). Wildcards Wildcard characters match any character. Within a regular expression, a period (“.”) matches any character, be it a space, a letter, a digit, punctuation, anything. Modifiers A modifier alters the meaning of the immediately preceding pattern character. For example, the expression “ab*c” matches the strings “ac”, “abc”, “abbc”, “abbbc”, and so on, because the asterisk (“*”) is a modifier that means “any number of (including zero)”. Thus, our pattern means to match any sequence of characters consisting of one “a”, a (possibly empty) series of “b” characters, and a final “c” character. Anchors Anchors establish the context for the pattern, such as "the beginning of a line", or "the end of a word". For example, the expression “cat” would match any occurrence of the three letters, while “^cat” would only match lines that begin “cat”.

  31. Taking Literals Literally Literals are straightforward because each literal character in a regular expressions matches one, and only one, copy of itself in the searched text. Uppercase characters are distinct from lowercase characters, so that “A” does not match “a”. Wildcards The "dot" wildcard The character “.” is used as a placeholder, to match one of any character. In the following example, the pattern matches any occurrence of the literal characters “x” and “s”, separated by exactly two other characters.

  32. Bracket Expressions: Ranges of Literal Characters Normally a literal character in a regex pattern matches exactly one occurrence of itself in the searched text. Suppose we want to search for the string “hello” regardless of how it is capitalized: we want to match “Hello” and “HeLLo” as well. How might we do that? A regex feature called a bracket expression solves this problem neatly. A bracket expression is a range of literals enclosed in square brackets (“[” and “]”). For example, the regex pattern “[Hh]” is a character range that matches exactly one character: either an uppercase “H” or a lowercase “h” letter. Notice that it doesn't matter how large the set of characters within the range is, the set matches exactly one character, if it matches any at all. A bracket expression that matches the set of lowercase vowels could be written “[aeiou]” and would match exactly one vowel. In the following example, bracket expressions are used to find words from the file /usr/share/dict/words. In the first case, the first five words that contain three consecutive (lowercase) vowels are printed. In the second case, the first 5 words that contain lowercase letters in the pattern of vowel-consonant-vowel-consonant-vowel-consonant are printed.

  33. Bracket Expressions: Ranges of Literal Characters If the first character of a bracket expression is a “^”, the interpretation is inverted, and the bracket expression will match any single occurrence of a character not included in the range. For example, the expression “[^aeiou]” would match any character that is not a vowel. The following example first lists words which contain three consecutive vowels, and secondly lists words which contain three consecutive consonant-vowel pairs.

  34. Range Expressions vs. Character Classes: Old School and New School Another way to express a character range is by giving the start- and end-letters of the sequence this way: “[a-d]” would match any character from the set a, b, c or d. A typical usage of this form would be “[0-9]” to represent any single digit, or “[A-Z]” to represent all capital letters.

  35. Range Expressions vs. Character Classes: Old School and New School As an alternative to such quandaries, modern regular expression make use character classes. Character classes match any single character, using language specific conventions to decide if a given character is uppercase or lowercase, or if it should be considered part of the alphabet or punctuation. The following table lists some supported character classes, and the ASCII equivalent range expression, where appropriate.

  36. Range Expressions vs. Character Classes: Old School and New School Character classes avoid problems you may run into when using regular expressions on systems that use different character encoding schemes where letters are ordered differently. For example, suppose you were to run the command: On a Red Hat Enterprise Linux system, this would match every word in the file, not just those that contain capital letters as one might assume. This is because in unicode (utf-8), the character encoding scheme that RHEL uses, characters are alphabetized case-insensitively, so that [A-Z] is equivalent to [AaBbCc...etc].

  37. Range Expressions vs. Character Classes: Old School and New School On older systems, though, a different character encoding scheme is used where alphabetization is done case-sensitively. On such systems [A-Z] would be equivalent to [ABC...etc]. Character classes avoid this pitfall. You can run: on any system regardless of the encoding scheme being used and it will only match lines that contain capital letters. For more details about the predefined range expressions, consult the grep manual page. For more information on character encoding schemes under Linux, refer back to chapter 8.3. To learn about how character encoding schemes are used to support other languages in Red Hat Enterprise Linux, begin with the locale manual page.

  38. Common Modifier Characters We saw a common usage of a regex modifier in our earlier example “ab*c” to match an a and c character with some number of b letters in between. The “*” character changed the interpretation of the literal b character from matching exactly one letter to matching any number of b's. Here are a list of some common modifier characters: b? The question mark (“?”) means “either one or none”: the literal character is considered to be optional in the searched text. For example, the regex pattern “ab?c” matches the strings “ac”, and “abc”, but not “abbc”. b* The asterisk (“*”) modifier means “any number of (including zero)” of the preceding literal character. The regex pattern “ab*c” matches the strings “ac”, “abc”, “abbc”, and so on.

  39. Common Modifier Characters b+ The plus (“+”) modifier means “one or more”, so the regex pattern “b+” matches a non-empty sequence of b's. The regex pattern “ab+c” matches the strings “abc” and “abbc”, but does not match “ac b{m,n} The brace modifier is used to specify a range of between m and n occurrences of the preceding character. The regex pattern “b{2,4}” would match “abbc” and “abbbc”, and “abbbbc”, but not “abc” or “abbbbbc”. b{n} With only one integer, the brace modifier is used to specify exactly n occurrences for the preceding character.

  40. Common Modifier Characters In the following example, egrep prints lines from /usr/share/dict/words that contain patterns which start with a (capital or lowercase) “a”, might or might not next have a (lowercase) “b”, but then definitely follow with a (lowercase) “a”. The following example prints lines which contain patterns which start “al”, then use the “.” wildcard to specify 0 or more occurrences of any character, followed by the pattern “bra”.

  41. Common Modifier Characters Notice we found variations on the words algebra and calibrate. For the former, the .* expression matched “ge”, while for the latter, it matched the letter “i”. The expression “.*”, which is interpreted as "0 or more of any character", shows up often in regex patterns, acting as the "stretchable glue" between two patterns of significance. As a subtlety, we should note that the modifier characters are greedy: they always match the longest possible input string. For example, given the regex pattern:

  42. Anchored Searches Four additional search modifier characters are available: ^foo A caret (“^”) matches the beginning of a line. Our example “^foo” matches the string “foo” only when it is at the beginning of a line foo$ A dollar sign (“$”) matches the end of a line. Our example “foo$” matches the string “foo” only at the end of a line, immediately before the newline character. \<foo\> By themselves, the less than sign (“<”) and the greater than sign (“>”) are literals. Using the backslash character to escape them transforms them into meaning “first of a word” and “end of a word”, respectively. Thus the pattern “\>cat\<” matches the word “cat” but not the word “catalog”. You will frequently see both ^ and $ used together. The regex pattern “^foo$” matches a whole line that contains only “foo” and would not match that line if it contained any spaces. The \< and \> are also usually used as pairs.

  43. Anchored Searches In the following an example, the first search lists all lines that contain the letters “ion” anywhere on the line. The second search only lists lines which end in “ion”.

  44. Coming to Terms with Regex Grouping The same way that you can use parenthesis to group terms within a mathematical expression, you also use parenthesis to collect regular expression pattern specifiers into groups. This lets the modifier characters “?”, “*” and “+” apply to groups of regex specifiers instead of only the immediately preceding specifier. Suppose we need a regular expression to match either “foo” or “foobar”. We could write the regex as “foo(bar)?” and get the desired results. This lets the “?” modifier apply to the whole string “bar” instead of only the preceding “r” character. Grouping regex specifiers using parenthesis becomes even more flexible when the pipe symbol (“|”) is used to separate alternative patterns. Using alternatives, we could rewrite our previous example as “(foo|foobar)”. Writing this as “foo|foobar” is simpler and works just as well, because just like mathematics, regex specifiers have precedence. While you are learning, always enclose your groups in parenthesis.

  45. Coming to Terms with Regex Grouping In the following example, the first search prints all lines from the file /usr/share/dict/words which contain four consecutive vowels (compare the syntax to that used when first introducing range expressions, above). The second search finds words that contain a double “o” or a double “e”, followed (somewhere) by a double “e”.

  46. Escaping Meta-Characters Sometimes you need to match a character that would ordinarily be interpreted as a regular expression wildcard or modifier character. To temporarily disable the special meaning of these characters, simply escape them using the backslash (“\”) character. For example, the regex pattern “cat.” would match the letters “cat” followed by any character: “cats” or “catchup”. To match only the letters “cat.” at the end of a sentence, use the regex pattern “cat\.” to disable interpreting the period as a wildcard character. Note one distracting exception to this rule. When the backslash character precedes a “<” or “>” character, it enables the special interpretation (anchoring the beginning or ending of a word) instead of disabling the special interpretation. Shudder. It even gets worse - see the footnote at the bottom of the following table.

  47. Summary of Linux Regular Expression Syntax The following table summarizes regular expression syntax, and identifies which components are found in basic regular expression syntax, and which are found only in the extended regular expression syntax.

  48. Summary of Linux Regular Expression Syntax The following table summarizes regular expression syntax, and identifies which components are found in basic regular expression syntax, and which are found only in the extended regular expression syntax.

  49. Regular Expressions are NOT File Globbing When first encountering regular expressions, students understandably confuse regular expressions with pathname expansion (file globbing). Both are used to match patterns in text. Both share similar metacharacters (“*”, “?”, “[...])”, etc.). However, they are distinctly different. The following table compares and contrasts regular expressions and file globbing.

  50. Regular Expressions are NOT File Globbing In the following example, the first argument is a regular expression, specifying text which starts with an “l” and ends “.conf”, while the second argument is a file glob which specifies all files in the /etc directory whose filename starts with “l” and ends “.conf”. Take a close look at the second line of output. Why was it matched by the specified regular expression? Why does the line containing the text “krb5.conf” match the expression? The “l” is found way back in the word “default”! In a similar vain, when specifying regular expressions on the bash command line, care must be taken to quote or escape the regex meta-characters, lest they be expanded away by the bash shell with unexpected results. In all of the examples found in this discussion, the first argument to the egrep command is protected with single quotes for just this reason.

More Related