TEXT PROCESSING UTILITIES

TEXT PROCESSING UTILITIES

THE cat COMMAND • $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product | 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product | 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 3456 | anil |chman |sales | 30/02/69 |40000 6789 | lalith |mrg | mark. | 17/01/80 |60000 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores the information about various employees. that is employeenumber. emp name designation department date of birth and their salary.

DISPLAYING THE BEGINNING OF A FILE – THE head COMMAND • The head command as the name implies displays the top LINES of the file. When used without an option it displays the first ten records of the argument file.

$ head emp.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 3456 | anil |chman |sales | 30/02/69 |40000 6789 | lalith |mrg | mark. | 17/01/80 |60000 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores

You can specify the line count and display say the first three lines of the file. Use the – symbol, followed by a numeric argument. • Ex: $ head -3 emp.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 • If the linecount specified exceeds the number of lines actually present in the file, head displays the entire file. • You can also find out the “record length” by word counting the first line of the file : • $ head -1 emp.lst | wc -c 47

head also works with multiple files. For each file it indicates the filename and the lines extracted: • $ head -2 emp.lst f1.lst ==> emp.lst <== 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 ==> f1.lst <== root tty7 2009-07-25 09:56 (:0) root pts/1 2009-07-25 09:56 (:0)

DISPLAYING THE END OF A FILE – THE tail COMMAND • The tail command displays the end of the file. It provides an additional method of addressing lines, and can also extract information in units of blocks and characters. • Like head it displays the last ten lines when used without arguments. • Ex: • $ tail -3 emp.lst department date of birth and their salary.

$ tail emp.lst This is the emp database which stores the information about various employees. that is employeenumber. emp name designation department date of birth and their salary.

[itlaxmi@snist ~]$ tail -40c emp.lst artment date of birth and their salary. • Ex: $ tail -v emp.lst • ==> emp.lst <== • This is the emp • database which stores • the information about various • employees. • that is employeenumber. • emp name • designation • department • date of birth • and their salary.

The disadvantage with head and tail is that they cannot display a range of lines. Moreover what is displayed is final. That is if we have displayed the first 50 lines in a file, we cannot move back and view say the 10 lines. • -v • If you use this option it will always print the headers giving the file name.

Tail also address lines from the beginning of the file instead of the end. The + count option allows you to do that, where count represents the line number from where the selection should begin. • Ex: • $ tail -n +8 emp.lst 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores the information about various employees. that is employeenumber. emp name designation department date of birth and their salary.

SLITTING A FILE VERTICALLY – THE cut COMMAND • While head and tail are used to slice a file horizontally, you can slice a file vertically with the cut command. Cut identifies both columns and fields. • Syntax: cut <options> <character or field list> <file(s)> • Ex: store the first 5 lines of the file emp.lst in a file shortlist. • $ head -5 emp.lst >shortlist

$ cat shortlist 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 • cut can be used to extract specific columns from this file. Use the –c (columns) option for cutting columns: • $ cut -c5-20 shortlist | shukla | g.m | sharma |d.g.m | akash |dir. | tiwary |g.m | kumar | mgr • Column numbers must immediately follow the option. Ranges are permitted, and commas are used to separate the column chunks.

$ cut -c2-5,10-15,40- shortlist 233 ukla || 20000 876 arma || 15000 898 ash ||9000 456 wary ||23000 234 mar ||15000 • The expression 40- indicates column number 55 to end of the line. • The method of tracking fields by column positions is tedious and also the file may doesn’t contain fixed length records. • You can extract specific fields using two options -d (delimiter) for specification of the field delimiter and –f (field) for specifying the field list: • When you use the –f option, don’t forget to use the –d option too, unless the file has the default delimiter (the tab).

PASTING FILES – THE paste COMMAND • What you “cut” with the previous command can be pasted with the paste command. • In this respect it resembles the cat command. But while cat pastes more than one file horizontally, paste does it vertically. • $ cut -d"|" -f6 shortlist | tee clist2 20000 15000 9000 23000 15000 • Cut was used to create two files clist1 and clist2, containing two cut-out portions of the same file.

While using the –d option along with several files in the command line, you can specify more than one delimiter. For ex: • $ paste –d” |#~” file1 file2 file3 file4 file5 • The above example uses the space character for pasting file1 and file2, the | character for pasting file2 and file3 and so forth.

ORDERING A FILE – THE sort COMMAND • Sorts the contents of a file. • It can merge multiple sorted files and store the result in the specified output file. • When the command is invoked without options, it sorts the entire line : • Ex: • $ sort shortlist 1234 | kumar | mgr |accnts | 18/03/79 |15000 2233 | shukla | g.m | sales | 12/12/52 | 20000 3456 | tiwary |g.m |product| 05/02/89 |23000 7898 | akash |dir. |mark. | 11/06/70 |9000 9876 | sharma |d.g.m|product| 12/03 60 | 15000

Sorting starts with the first character of each line in the file. If the first character of two lines is same then the second character in each line is compared and so on. • The sorting is done according to the ASCII collating sequence. That is, it sorts the spaces and tabs first, then the punctuation marks followed by numbers, uppercase letters and lowercase letters in that order. • Like cut and paste, sort also works on fields, and the default field separator is the space character. The –t option, followed immediately by the delimiter, overrides the default. This lets you to sort the file on any field, for instance, the second field (name): • $ sort –t”|” –k2 shortlist

The sort order can be reversed with the –r (reverse) option. • Ex: • $ sort -r shortlist 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 2233 | shukla | g.m | sales | 12/12/52 | 20000 1234 | kumar | mgr |accnts | 18/03/79 |15000 • We can sort the contents of several files at one shot as in: • $ sort file1 file2 file3

Instead of displaying the sorted output on the screen we can store it in a file by saying, • $ sort –o result clist1 • $ cat result akash |dir. kumar | mgr sharma |d.g.m shukla | g.m tiwary |g.m • To check whether the file has actually been sorted, use • $ sort –c shortlist

Sorting on secondary key: • You can sort on more than one key, i.e., you can provide a secondary key to sort. For example, if the primary key is the 3rd field, and the secondary key is the 2nd field, then you need to specify for every –k option, where the sort ends. This is done in this way: • $ sort -t"|" -k3,3 -k2,2 shortlist 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 2233 | shukla | g.m | sales | 12/12/52 | 20000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 • This sorts the file by designation and name. the –k3,3 option indicates that sorting starts on the 3rd field and ends on the same field.

Sorting on columns : • You can also specify a character position within a field to be the beginning of sort. For example, if you are to sort the file according to the year of birth, then you need to sort on the 7th and 8th column positions within 5th field: • $ sort -t"|" -k5.7,5.8 shortlist 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 1234 | kumar | mgr |accnts | 18/03/79 |15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000

Numeric sort (-n): • When sort acts on numerals, strange things can happen. • [itlaxmi@snist ~]$ cat>nfile 2 4 10 27 • [itlaxmi@snist ~]$ sort nfile 10 2 27 4 • This is probably not what you expected, but the ASCII collating sequence places 1 above 2, and 2 above 4. That’s why 10 preceded 2 and 27 preceded 4. This can be overridden by the –n (numeric ) option.

[itlaxmi@snist ~]$ sort -n nfile 2 4 10 27

Removing Repeated Lines (-u): • The –u (unique) option lets you remove repeated lines from a file. To find out the unique designations that occur in the file, cut out the designation field and pipe it to sort : • $ cut -d"|" -f3 e.lst | sort -u |tee desg.lst dir. g.m mgr • Merge sort (-m): • When sort is used with multiple filenames as arguments, it concatenates them and sorts them collectively. • When large files are sorted in this way, performance often suffers. The –m (merge) option can merge two or more files that are sorted individually. • $ sort –m f1 f2 f3

sort options • OptionDescription • -tchar Uses delimeter char to identify fields • -k n Sorts on nth field • -k m,n Starts sort on mth field and ends sort on nth field • -k m.n Starts sort on nth column of mth field • -u Removes repeated lines • -n Sorts numerically • -r Reverses sort order • -f Folds lowercase to equivalent uppercase (case insensitive sort) • -m list Merges sorted files in list • -c Checks if the file is sorted • -o flname Places output in file flname

THE uniq COMMAND • There is often problem of duplicate entries creeping in due to faulty data entry. Unix offers a special tool to handle these records -- the uniq command. • The command is most useful when placed in pipelines, and can be used as an SQL type query tool (distinct). • Ex: $ cat dept.lst 01 | accounts | 6213 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521 03 | marketing | 6521 • $ uniq dept.lst 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521

uniq simply fetches one copy of the redundant records, writing them to the standard output. • Since uniq requires a sorted file as input, the general procedure is to sort a file and pipe the process to uniq. The following pipeline also produces the same output, except that the output is saved in a file : • $ sort dept.lst | uniq - ulist • [itlaxmi@snist d1]$ cat ulist 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521 • Like sort, uniq also accepts the filename as an argument. Since it is done without using an option (unlike –o in sort), you should make sure that you don’t specify multiple filenames as input to this command; • uniq uses only one file at a time.

If we use two filenames, then uniq simply processes first file and overwrites the second with its output. So you lose the data in the second file. • If uniq is to merely select unique lines, it is preferable to use sort –u. But uniq has a couple of options which can be used to make simple database queries. • Ex: To determine the designation that occurs uniquely in the file e.lst, cut out the 3rd field, sort it, and then pipe it to uniq. • $ cat e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma | mgr |product| 12/03 60 | 15000 7898 | akash | dir. |mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |1500

LINE NUMBERING – THE nl COMMAND • There is separate command in UNIX system that has elaborate schemes for numbering lines --the nl command • nl numbers only logical lines, i.e. the new line character containing something apart from the new line character. • By default, nl simply adds line numbers to its input, and prints them in a space six characters wide: • Ex: • $ nl clist1 • 1 shukla | g.m • 2 sharma |d.g.m • 3 akash |dir. • 4 tiwary |g.m • 5 kumar | mgr

nl uses the tab character to separate the numbers from the text. Use the –w(width) option to specify the width of the number format, and –s (separator) to specify the separator: • Ex: • $ nl -w2 -s":" clist1 • 1: shukla | g.m • 2: sharma |d.g.m • 3: akash |dir. • 4: tiwary |g.m • 5: kumar | mgr

To have leading zeroes in the first field, use –n option: • Ex: • $ nl -w2 -s":" -nrz clist1 • 01: shukla | g.m • 02: sharma |d.g.m • 03: akash |dir. • 04: tiwary |g.m • 05: kumar | mgr • The –n option, followed immediately by the parameter rz, right justifies the number, with the leading zeroes to fill the gaps. The other format you can use is ln, which left justifies the number and removes the leading zeroes.

In many applications, you have code tables starting from a number different from 1 (or 01 or 001). The –v option followed by a number, determines the initial value that is to be used to number the lines. You can use the number 40 as the initial value: • Ex: • $ nl -w2 -s":" -nrz -v40 clist1 • 40: shukla | g.m • 41: sharma |d.g.m • 42: akash |dir. • 43: tiwary |g.m • 44: kumar | mgr

TRANSLATING CHARACTERS -THE tr COMMAND • The tr (translate) filter manipulates individual characters in a line. • It translates characters using one or two compact expressions: • Syntax: • tr options expression1 expression2 standard input • tr takes input only from the standard input; it doesn’t take a filename as argument. • By default, it translates each character in expression1 to its mapped counterpart in expression2. • The 1st character in 1st expression is replaced with the 1st character in the 2nd expression, and similarly for the other characters.

Using ASCII octal values and escape sequences : • tr also uses octal values and escape sequences to represent characters. • To have each field on a separate line, replae the “|” with the LF character (octal value 012): • $ tr '|' '\012' < emp.lst |head -n 6 2233 shukla g.m sales 12/12/52 20000

Deleting characters (-d) : • To delete the characters “|” and “/” from the file: • $ tr –d ‘|/’ < shortlist | head –n 2 • 2233 shukla g.m sales 121252 20000 • 9876 sharma d.g.m product 1203 60 15000 • Compressing Multiple Consecutive characters (-s): • We can eliminate all redundant spaces in the files with delimited fields with the –s (squeeze) option. • The –s option squeezes multiple consecutive occurrences of its argument to a single character. • $ tr –s ‘ ‘ <shortlist | head –n 3

File Utilities Cut Paste Head Tail Cmp Comm Diff

Filters • A group of commands, each of which accepts some data as input, performs some manipulation on it, and produces some output. Since they perform some filtering action on the data, they are appropriately called filters. • Grep • Egrep • Fgrep • Sed • Awk • sort • uniq • nl

SEARCHING FOR A PATTERN – THE grep COMMAND • The grep (global regular expression printer) scans a file for the occurrence of a pattern. • It uses a couple of options, and depending on their usage, outputs the lines containing the pattern, or the filenames or the line numbers. • Syntax: grep <options> <pattern><filename(s)> • Most of the grep’s options are shared by its other members also (egrep and fgrep).

In addition to options, grep compulsorily requires an expression to represent the pattern to be searched for. The first argument (barring the option) is always treated as the expression, and the ones remaining as the filenames. • grep looks for all occurrences of the expression in its input, and, by default, outputs the lines containing the expression.

Ex: • $ grep "sales" e.lst • 2233 | shukla | g.m | sales | 12/12/52 | 20000 • When grep is used with multiple filenames, it displays the filenames along with the output. • $ grep "sales" e.lst shortlist e.lst:2233 | shukla | g.m | sales | 12/12/52 | 20000 shortlist:2233 | shukla | g.m | sales | 12/12/52 | 20000

Because grep is also a filter, it can search its standard input for the pattern and store the output in a file: $ Who | grep itlaxmi > fff • Quoting in grep: • Quoting is essential if the search string consists of more than one word, or uses any of the shell’s characters like *,$ etc. • grep simply returns the prompt when the pattern can’t be located. • $ grep president shortlist • $

grep options OptionSignificance • -c Displays count of number of occurrences • -l Displays list of the filenames only • -n Displays line numbers along with the lines • -v Doesn’t display lines matching expression • -i Ignores case for matching • -h Omits filenames when handling multiple files • -f flname Takes expressions from file flname (egrep and fgrep only). • -x Displays lines matched in entirety (fgrep only)

Examples • 1. $ grep -h mgr emp.lst shortlist 1234 | kumar | mgr |accnts | 18/03/79 |15000 1234 | kumar | mgr |accnts | 18/03/79 |15000 • 2. $ grep -c 'mgr' e.lst emp.lst e.lst:2 emp.lst:1 • 3.$ grep -n 'mgr' e.lst emp.lst e.lst:2:9876 | sharma | mgr |product| 12/03 60 | 15000 e.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |1500 emp.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |15000

Examples • 4. $ grep -v 'mgr' e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 7898 | akash | dir.|mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |23000 • -v option is used for deleting lines in grep. • 5. $ grep -l 'mgr' *.lst desg.lst desig.lst e1.lst e.lst emp1.lst emp.lst

TEXT PROCESSING UTILITIES

TEXT PROCESSING UTILITIES

Presentation Transcript

Text Processing

Strings and Text Processing

TEXT PROCESSING 1

Basic Text Processing

Lecture 8: Text processing

Text processing

Text Processing

Text Processing

Advanced Text Processing

Text processing

Processing == Java + Extra Utilities Processing Adds: Drawing functions

Text Processing

Chapter 23 Text Processing

Advanced Text Processing

Text, not Word Processing

Text Processing

Text processing

Text processing