Home > Articles

  • Print
  • + Share This
This chapter is from the book

Processing Text

Course Objectives Covered

  1. Executing Commands at the Command Line (3036)

  2. Common Command Line Tasks (3036)

  3. Piping and Redirection (3036)

  4. Creating, Viewing, and Appending Files (3036)

The simplest text processing utility of all is cat, a derivative of the word concatenate. By default, it will display the entire contents of a file on the screen (standard output). However, a number of useful options can be used with it, including the following:

  • -b to number lines

  • -E to show a dollar sign ($) at the end of each line (carriage return)

  • -T to show all tabs as "^I"

  • -v to show nonprinting characters except tabs and carriage returns

  • -A to show the same as -v combined with -E and -T

To illustrate the uses of cat, assume that there is a four-line file named example with the following contents:

How much wood
could a woodchuck chuck
if a woodchuck 
could chuck wood?

To view the contents of the file on the screen, exactly as they appear in the preceding example, the command is

cat example

To view the file with lines numbered, the command, and the output generated, will be

cat -b example
  1  How much wood
  2  could a woodchuck chuck
  3  if a woodchuck 
  4  could chuck wood?

Note the inclusion of the tab characters that were not there before, but were added by the numbering process. They are not truly in the file, but only added to the display, as can be witnessed with the following command:

cat -Ab example
  1  How much wood$
  2  could a woodchuck chuck$
  3  if a woodchuck$ 
  4  could chuck wood?$

The only nonprintable characters within the file are the carriage returns at the end, which appear as dollar signs.

One of the most common uses of the cat utility is to quickly create a text file. From the command line, you can specify no file at all to display and redirect the output to a given filename. This then accepts keyboard input and places it in the new file until the end-of-file character is received (the key sequence is Ctrl+D, by default).

The following example includes a dollar sign ($) prompt to show this operation in process:

$ cat > example
Peter Piper picked a peck of pickled peppers
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
{press Ctrl+D}
$

The Ctrl+D sequence is pressed on a line by itself and signifies the end of the file. Viewing the contents of the directory (via the ls utility) will show that the file has now been created, and its contents can be viewed like this:

cat example

Note that the single redirection (>) creates a file named example if it did not exist before, and overwrites it if did. To add to an existing file, use the append character (>>).

NOTE

The Ctrl+D keyboard sequence is the typical default for specifying an end-of-file operation. Like almost everything in Linux, this can be changed, customized, and so on. To see the settings for your session, use this command:

stty -a

and look for "eof = ".

There is a utility of use in limited circumstances—tac—which will display the contents of files in reverse order (tac is cat in reverse order). Instead of displaying a file from line 1 to the end of the file, it shows the file from the end of the file to line 1, as illustrated in the following example:

$ tac example
Where's the peck of pickled peppers Peter Piper picked?
If Peter Piper picked a peck of pickled peppers,
A peck of pickled peppers Peter Piper picked.
Peter Piper picked a peck of pickled peppers
$

nl, head, and tail

Three simple commands can be used to view all or parts of files: nl, head, and tail. The first, nl, is used to number the lines, and is similar to cat -b. Both will number the lines of display, and by default, neither will number blank lines. There are certain options that nl can utilize to alter the display:

  • -i allows you to change the increment (default is 1).

  • -v allows you to change the starting number (default 1).

  • -n changes the alignment of the display:

    • -nln aligns the display on the left.

    • -nrn aligns the display on the right.

    • -nrz uses leading zeros.

  • --s uses a specified character between the line number and the text (default is a space).

The second utility to examine is head. As the name implies, this utility is used to look at the top portion of a file: by default, the first 10 lines. You can change the number of lines displayed by using a dash followed by the number of lines to display. The following examples assume there is a text file named numbers with 200 lines in it counting from "one" to "two hundred":

$ head numbers
one
two
three
four
five
six
seven
eight
nine
ten
$
$ head -3 numbers
one
two
three
$
$ head -50 numbers
one
two
three
{skipping for space purposes}
forty-eight
forty-nine
fifty
$

NOTE

When printing multiple files, head places a header before each listing identifying what file it is displaying. The -q option suppresses the headers.

The tail command has several modes in which it can operate. By default, it is the opposite of head, and shows the end of file rather than the beginning. Once again, it defaults to the number 10 to display, but that can be changed by using the dash and a number:

$ tail numbers
one hundred ninety-one
one hundred ninety-two
one hundred ninety-three
one hundred ninety-four
one hundred ninety-five
one hundred ninety-six
one hundred ninety-seven
one hundred ninety-eight
one hundred ninety-nine
two hundred
$
$ tail -3 numbers
one hundred ninety-eight
one hundred ninety-nine
two hundred
$
$ tail -50 numbers
one hundred fifty-one
one hundred fifty-two
one hundred fifty-three
{skipping for space purposes}
one hundred ninety-eight
one hundred ninety-nine
two hundred
$

The tail utility goes beyond this functionality, however, by including a plus (+) option. This allows you to specify a starting point beyond which you will see the entire file. For example

$ tail +50 numbers

This will start with line 50 (skipping the first 49) and display all the rest of the file—151 lines in this case. Another useful option is -f, which allows you to follow a file. The command

$ tail -f numbers

will display the last 10 lines of the file, but then stay open—following the file—and display any new lines that are appended to the file. To break out of the endless monitoring loop, you must press the interrupt key sequence, which is Ctrl+C by default on most systems.

NOTE

To find the interrupt key sequence for your session, use the command

stty -a

and look for "intr = ".

cut, paste, and join

The ability to separate columns that could constitute data fields from a file is provided by the cut utility. The default delimiter used is the tab, and the -f option is used to specify the desired field. For example, suppose there is a text file named august with three columns, looking like this:

one  two  three
four  five  six
seven  eight  nine
ten  eleven  twelve

Then the following command

cut -f2 august

will return

two
five
eight
eleven

However, the following example

cut -f1,3 august

will return the opposite:

one  three
four  six
seven  nine
ten  twelve

A number of options are available with this command; the two to be familiar with (besides -f) are -c and -d:

  • -c allows you to specify characters instead of fields.

  • -d allows you to specify a delimiter other than the tab.

To illustrate how to use the other options, the ls -l command will show: permissions, number of links, owner, group, size, date, and filename—all separated by whitespace, with two characters between the permissions and links. If you only want to see who is saving files in the directory, and are not interested in the other data, you can use

ls -l | cut -d" " -f5

This will ignore the permissions (first field), two sets of whitespace (second and third fields), number of links (fourth field), and display the owner (fifth field), ignoring everything following. Another way to look at this is that with ls -l the permissions always take up 10 characters, followed by whitespace of 3 characters, then the number of links, and whitespace that follows. The owner always begins with the 16th character and continues for the length of the name. The command

ls -l | cut -c16

will return the 16th character—the first letter of the owner's name. If an assumption is made that most users will use eight characters or less for their name, the command

ls -l | cut -c16-24

will return those entries in the name field.

The name of the file begins with the 55th character, but it can be impossible to determine how many characters after that to take because some filenames will be considerably longer than others. A solution to this is to begin with the 55th character, and not specify an ending character (meaning that the entire rest of the line is taken), as in this example:

ls -l | cut -c55-

Paste

Whereas the cut utility extracts fields from a file, they can be combined using either paste or join. The simplest of the two is paste—it has no great feature sets at all and merely takes one line from one source and combines it with another line from another source. For example, if the contents of fileone are

Indianapolis
Columbus
Peoria
Livingston
Scottsdale

And the contents of filetwo are

Indiana
Ohio
Illinois
Montana
Arizona

Then the following (including prompts) would be the display generated:

$ paste fileone filetwo
Indianapolis  Indiana
Columbus  Ohio
Peoria  Illinois
Livingston  Montana
Scottsdale  Arizona
$

If there were more lines in fileone than filetwo, the pasting would continue, but with blank entries following the tab. The tab character is always the default delimiter, but that can be changed to anything by using the -d option:

$ paste -d"," fileone filetwo
Indianapolis,Indiana
Columbus,Ohio
Peoria,Illinois
Livingston,Montana
Scottsdale,Arizona
$

You can also use the -s option to output all of fileone on a single line, followed by a carriage return and then filetwo:

$ paste -s fileone filetwo
Indianapolis  Columbus  Peoria  Livingston  Scottsdale  
Indiana  Ohio  Illinois  Montana  Arizona
$

Join

You can think of the join utility as a greatly enhanced version of paste. It is critically important, however, to know that the utility can only work if the files being joined share a common field. For example, if join were used in the same example as paste was earlier, the result would be

$ join fileone filetwo
$

In other words, there is no display. join must find a common field between the files in question and, by default, expects that common field to be the first. For example, assume that fileone now contains these entries:

11111  Indianapolis
22222  Columbus
33333  Peoria
44444  Livingston
55555  Scottsdale

And the contents of filetwo are

11111  Indiana  500 race
22222  Ohio  Buckeye State
33333  Illinois  Wrigley Field
44444  Montana  Yellowstone Park
55555  Arizona  Grand Canyon

Then the following (including prompts) would be the display generated:

$ join fileone filetwo
11111  Indianapolis  Indiana  500 race
22222  Columbus  Ohio  Buckeye State
33333  Peoria  Illinois  Wrigley Field
44444  Livingston  Montanta  Yellowstone Park
55555  Scottsdale  Arizona  Grand Canyon
$

The commonality of the first field was identified and the matching entries were combined. Whereas paste blindly took from each file to create the display, join will only combine lines that match and—of critical importance—it must be an exact match with the corresponding line in the other file. This point cannot be illustrated enough; for example, suppose filetwo had an additional line in the middle:

11111  Indiana  500 race
22222  Ohio  Buckeye State
66666  Tennessee  Smokey Mountains
33333  Illinois  Wrigley Field
44444  Montana  Yellowstone Park
55555  Arizona  Grand Canyon

Then the following (including prompts) would be the display generated:

$ join fileone filetwo
11111  Indianapolis  Indiana  500 race
22222  Columbus  Ohio  Buckeye State
$

As soon as the files no longer match, no further operations can be carried out. Each line is checked with the same—and only the same—line in the opposite file for a match on the default field. If matches are found, they are incorporated in the display; otherwise they are not. To illustrate one more time—using the original filetwo:

$ tac filetwo > filethree
$ join fileone filethree
55555  Scottsdale  Arizona  Grand Canyon
$

Even though a match exists for every line in both files, only one match is found.

NOTE

It is highly recommended that you overcome problems with join by first sorting each of the files to be used to get them in like order.

You don't have to keep the defaults with join from looking at only the first fields for matches or from outputting all columns. The -1 option lets you specify what field to use as the matching field in fileone, whereas the -2 option lets you specify what field to use as the matching field in filetwo. For example, if the second field of fileone were to match with the third field of filetwo, the syntax would be

$ join -1 2 -2 3 fileone filethree

The -o option is used to specify output fields in the format {file.field}. Thus to only print the second field of fileone and the third field of filetwo on matching lines, the syntax would be

$ join -o 1.2 2.3 fileone filethree
Indianapolis  500
Columbus  Buckeye
Peoria  Wrigley
Livingston  Yellowstone
Scottsdale  Grand
$

Sort, Count, Format, and Translate

It is often necessary to not only display text, but to manipulate and modify it a bit before the output is shown, or simply gather information on it. Four utilities are examined in this section: sort, wc, fmt, and tr.

sort

The sort utility sorts the lines of a file in alphabetical order, and displays the output. The importance of alphabetical order, versus any other, cannot be overstated. For example, assume that the fileone file contains the following lines:

Indianapolis Indiana
Columbus
Peoria
Livingston
Scottsdale
1
2
3
4
5
6
7
8
9
10
11
12

When a sort is done on the file, the result becomes

$ sort fileone
1
10
11
12
2
3
4
5
6
7
8
9
Columbus
Indianapolis Indiana
Livingston
Peoria
Scottsdale
$

The cities are "correctly" sorted in alphabetical order. The numbers, however, are also in alphabetical order, which puts every number starting with "1" before every number starting with "2," and then every number starting with "3," and so on.

Thankfully, the sort utility includes some options to add a great deal of flexibility to the output. Among those options are the following:

  • -d to sort in phone directory order (the same as that shown in the preceding example)

  • -f to sort lowercase letters the same as uppercase

  • -i to ignore any characters outside the ASCII range

  • -n to sort in numerical order versus alphabetical

  • -r to reverse the order of the output

Thus the display can be changed to

$ sort -n fileone
Columbus
Indianapolis Indiana
Livingston
Peoria
Scottsdale
1
2
3
4
5
6
7
8
9
10
11
12
$

NOTE

The sort utility assumes all blank lines to be a part of the display and always places them at the beginning of the output. To prevent blank lines from being sorted, use the -b option.

wc

The wc utility (named for "word count") displays information about the file in terms of three values: number of lines, words, and characters. The last entry in the output is the name of the file, thus the output would be

$ wc fileone
  17  18  86  fileone
$

You can choose to see only some of the output by using the following options:

  • -c to show only the number of bytes/characters

  • -l to see only the number of lines

  • -w to see only the number of words

In all cases, the name of the file still appears, for example

$ wc -l fileone
  17  fileone
$

The only way to override the name appearing is by using the standard input redirection:

$ wc -l < fileone
  17
$

fmt

The fmt utility formats the text by creating output to a specific width. The default width is 75 characters, but a different value can be specified with the -w option. Short lines are combined to create longer ones unless the -s option is used, and spacing is justified unless -u is used. The -u option enforces uniformity and places one space between words and two spaces at the end of each sentence.

The following example shows how the fileone lines are combined to create a 75-character display:

$ fmt fileone
Indianapolis Indiana Columbus Peoria Livingston Scottsdale 1 2 3 4 5 6
7 8 9 10 11 12
$

To change the output to 60 characters, use this example:

$ fmt -w60 fileone
Indianapolis Indiana Columbus Peoria Livingston Scottsdale
1 2 3 4 5 6 7 8 9 10 11 12
$

NOTE

The default for any option with fmt is -w, thus fmt -60 fileone will give the same result as fmt -w60 fileone.

tr

The tr (translate) utility can convert one set of characters to another. Use the following example to change all lowercase characters to uppercase:

$ tr '[a-z]' '[A-Z]' < fileone
INDIANAPOLIS INDIANA
COLUMBUS
PEORIA
LIVINGSTON
SCOTTSDALE
$

NOTE

It is extremely important to realize that the syntax of tr only accepts two character sets, not the name of the file. You must feed the name of the file into the utility by directing input (as in the example given), by piping to it (|), or using a similar operation.

Not only can you give character sets as string options, but you can also specify a number of unique values, including

  • lower—All lowercase characters

  • upper—All uppercase characters

  • print—All printable characters

  • punct—Punctuation characters

  • space—All whitespace (blank can be used for horizontal whitespace only)

  • alnum—Alpha characters and numbers

  • digit—Numbers only

  • cntrl—Control characters

  • alpha—Letters only

  • graph—Printable characters but not whitespace

For example, the output shown earlier can also be obtained like this:

$ tr '[:lower:]' '[:upper:]' < fileone
INDIANAPOLIS INDIANA
COLUMBUS
PEORIA
LIVINGSTON
SCOTTSDALE
$

Other Useful Utilities

A number of other useful text utilities are included with Linux. Some of these have limited usefulness and are intended only for a specific purpose, but are given because knowing of their existence and purpose can make your life with Linux considerably easier.

In alphabetical order, the additional utilities are as follows:

  • expand—Allows you to expand tab characters into spaces. The default number of spaces per tab is 8, but this can be changed using the -t option. The opposite of this utility is unexpand.

  • file—This utility will look at an entry's signature and report what type of file it is—ASCII text, GIF image, and so on. The definitions it returns (and thus the files it can correctly identify) are defined in a file called magic. This file typically resides in /usr/share/misc or /etc.

  • more—Used to display only one screen of output at a time.

  • od—Can perform an octal dump to show the contents of files other than ASCII text files. Used with the -x option, it does a hexadecimal dump, and with the -c option, it shows only recognizable ASCII characters.

  • pr—Converts the file into a format suitable for printed pages— including a default header with date and time of last modification, filename, and page numbers. The default header can be overwritten with the -h option, and the -l option allows you to specify the number of lines to include on each page—the default is 66. Default page width is 72 characters, but a different value can be specified with the -w option. The -d option can be used to double-space the output, and -m can be used to print numerous files in column format.

  • split—Chops a single file into multiple files. The default is that a new file is created for every 1,000 lines of the original file. Using the -b option, you can avoid the thousand-line splitting and specify a number of bytes to be put into each output file, or use -l to specify a number of lines.

  • uniq—This utility will examine entries in a file, comparing the current line with the one directly preceding it, to find lines that are unique.

  • + Share This
  • 🔖 Save To Your Account