Home > Articles

This chapter is from the book

Processing Text

Course Objectives Covered

  1. Executing Commands at the Command Line (3036)

  2. Common Command Line Tasks (3036)

  3. Piping and Redirection (3036)

  4. Creating, Viewing, and Appending Files (3036)

The simplest text processing utility of all is cat, a derivative of the word concatenate. By default, it will display the entire contents of a file on the screen (standard output). However, a number of useful options can be used with it, including the following:

  • -b to number lines

  • -E to show a dollar sign ($) at the end of each line (carriage return)

  • -T to show all tabs as "^I"

  • -v to show nonprinting characters except tabs and carriage returns

  • -A to show the same as -v combined with -E and -T

To illustrate the uses of cat, assume that there is a four-line file named example with the following contents:

How much wood
could a woodchuck chuck
if a woodchuck 
could chuck wood?

To view the contents of the file on the screen, exactly as they appear in the preceding example, the command is

cat example

To view the file with lines numbered, the command, and the output generated, will be

cat -b example
  1  How much wood
  2  could a woodchuck chuck
  3  if a woodchuck 
  4  could chuck wood?

Note the inclusion of the tab characters that were not there before, but were added by the numbering process. They are not truly in the file, but only added to the display, as can be witnessed with the following command:

cat -Ab example
  1  How much wood$
  2  could a woodchuck chuck$
  3  if a woodchuck$ 
  4  could chuck wood?$

The only nonprintable characters within the file are the carriage returns at the end, which appear as dollar signs.

One of the most common uses of the cat utility is to quickly create a text file. From the command line, you can specify no file at all to display and redirect the output to a given filename. This then accepts keyboard input and places it in the new file until the end-of-file character is received (the key sequence is Ctrl+D, by default).

The following example includes a dollar sign ($) prompt to show this operation in process:

$ cat > example
Peter Piper picked a peck of pickled peppers
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
{press Ctrl+D}
$

The Ctrl+D sequence is pressed on a line by itself and signifies the end of the file. Viewing the contents of the directory (via the ls utility) will show that the file has now been created, and its contents can be viewed like this:

cat example

Note that the single redirection (>) creates a file named example if it did not exist before, and overwrites it if did. To add to an existing file, use the append character (>>).

NOTE

The Ctrl+D keyboard sequence is the typical default for specifying an end-of-file operation. Like almost everything in Linux, this can be changed, customized, and so on. To see the settings for your session, use this command:

stty -a

and look for "eof = ".

There is a utility of use in limited circumstances—tac—which will display the contents of files in reverse order (tac is cat in reverse order). Instead of displaying a file from line 1 to the end of the file, it shows the file from the end of the file to line 1, as illustrated in the following example:

$ tac example
Where's the peck of pickled peppers Peter Piper picked?
If Peter Piper picked a peck of pickled peppers,
A peck of pickled peppers Peter Piper picked.
Peter Piper picked a peck of pickled peppers
$

nl, head, and tail

Three simple commands can be used to view all or parts of files: nl, head, and tail. The first, nl, is used to number the lines, and is similar to cat -b. Both will number the lines of display, and by default, neither will number blank lines. There are certain options that nl can utilize to alter the display:

  • -i allows you to change the increment (default is 1).

  • -v allows you to change the starting number (default 1).

  • -n changes the alignment of the display:

    • -nln aligns the display on the left.

    • -nrn aligns the display on the right.

    • -nrz uses leading zeros.

  • --s uses a specified character between the line number and the text (default is a space).

The second utility to examine is head. As the name implies, this utility is used to look at the top portion of a file: by default, the first 10 lines. You can change the number of lines displayed by using a dash followed by the number of lines to display. The following examples assume there is a text file named numbers with 200 lines in it counting from "one" to "two hundred":

$ head numbers
one
two
three
four
five
six
seven
eight
nine
ten
$
$ head -3 numbers
one
two
three
$
$ head -50 numbers
one
two
three
{skipping for space purposes}
forty-eight
forty-nine
fifty
$

NOTE

When printing multiple files, head places a header before each listing identifying what file it is displaying. The -q option suppresses the headers.

The tail command has several modes in which it can operate. By default, it is the opposite of head, and shows the end of file rather than the beginning. Once again, it defaults to the number 10 to display, but that can be changed by using the dash and a number:

$ tail numbers
one hundred ninety-one
one hundred ninety-two
one hundred ninety-three
one hundred ninety-four
one hundred ninety-five
one hundred ninety-six
one hundred ninety-seven
one hundred ninety-eight
one hundred ninety-nine
two hundred
$
$ tail -3 numbers
one hundred ninety-eight
one hundred ninety-nine
two hundred
$
$ tail -50 numbers
one hundred fifty-one
one hundred fifty-two
one hundred fifty-three
{skipping for space purposes}
one hundred ninety-eight
one hundred ninety-nine
two hundred
$

The tail utility goes beyond this functionality, however, by including a plus (+) option. This allows you to specify a starting point beyond which you will see the entire file. For example

$ tail +50 numbers

This will start with line 50 (skipping the first 49) and display all the rest of the file—151 lines in this case. Another useful option is -f, which allows you to follow a file. The command

$ tail -f numbers

will display the last 10 lines of the file, but then stay open—following the file—and display any new lines that are appended to the file. To break out of the endless monitoring loop, you must press the interrupt key sequence, which is Ctrl+C by default on most systems.

NOTE

To find the interrupt key sequence for your session, use the command

stty -a

and look for "intr = ".

cut, paste, and join

The ability to separate columns that could constitute data fields from a file is provided by the cut utility. The default delimiter used is the tab, and the -f option is used to specify the desired field. For example, suppose there is a text file named august with three columns, looking like this:

one  two  three
four  five  six
seven  eight  nine
ten  eleven  twelve

Then the following command

cut -f2 august

will return

two
five
eight
eleven

However, the following example

cut -f1,3 august

will return the opposite:

one  three
four  six
seven  nine
ten  twelve

A number of options are available with this command; the two to be familiar with (besides -f) are -c and -d:

  • -c allows you to specify characters instead of fields.

  • -d allows you to specify a delimiter other than the tab.

To illustrate how to use the other options, the ls -l command will show: permissions, number of links, owner, group, size, date, and filename—all separated by whitespace, with two characters between the permissions and links. If you only want to see who is saving files in the directory, and are not interested in the other data, you can use

ls -l | cut -d" " -f5

This will ignore the permissions (first field), two sets of whitespace (second and third fields), number of links (fourth field), and display the owner (fifth field), ignoring everything following. Another way to look at this is that with ls -l the permissions always take up 10 characters, followed by whitespace of 3 characters, then the number of links, and whitespace that follows. The owner always begins with the 16th character and continues for the length of the name. The command

ls -l | cut -c16

will return the 16th character—the first letter of the owner's name. If an assumption is made that most users will use eight characters or less for their name, the command

ls -l | cut -c16-24

will return those entries in the name field.

The name of the file begins with the 55th character, but it can be impossible to determine how many characters after that to take because some filenames will be considerably longer than others. A solution to this is to begin with the 55th character, and not specify an ending character (meaning that the entire rest of the line is taken), as in this example:

ls -l | cut -c55-

Paste

Whereas the cut utility extracts fields from a file, they can be combined using either paste or join. The simplest of the two is paste—it has no great feature sets at all and merely takes one line from one source and combines it with another line from another source. For example, if the contents of fileone are

Indianapolis
Columbus
Peoria
Livingston
Scottsdale

And the contents of filetwo are

Indiana
Ohio
Illinois
Montana
Arizona

Then the following (including prompts) would be the display generated:

$ paste fileone filetwo
Indianapolis  Indiana
Columbus  Ohio
Peoria  Illinois
Livingston  Montana
Scottsdale  Arizona
$

If there were more lines in fileone than filetwo, the pasting would continue, but with blank entries following the tab. The tab character is always the default delimiter, but that can be changed to anything by using the -d option:

$ paste -d"," fileone filetwo
Indianapolis,Indiana
Columbus,Ohio
Peoria,Illinois
Livingston,Montana
Scottsdale,Arizona
$

You can also use the -s option to output all of fileone on a single line, followed by a carriage return and then filetwo:

$ paste -s fileone filetwo
Indianapolis  Columbus  Peoria  Livingston  Scottsdale  
Indiana  Ohio  Illinois  Montana  Arizona
$

Join

You can think of the join utility as a greatly enhanced version of paste. It is critically important, however, to know that the utility can only work if the files being joined share a common field. For example, if join were used in the same example as paste was earlier, the result would be

$ join fileone filetwo
$

In other words, there is no display. join must find a common field between the files in question and, by default, expects that common field to be the first. For example, assume that fileone now contains these entries:

11111  Indianapolis
22222  Columbus
33333  Peoria
44444  Livingston
55555  Scottsdale

And the contents of filetwo are

11111  Indiana  500 race
22222  Ohio  Buckeye State
33333  Illinois  Wrigley Field
44444  Montana  Yellowstone Park
55555  Arizona  Grand Canyon

Then the following (including prompts) would be the display generated:

$ join fileone filetwo
11111  Indianapolis  Indiana  500 race
22222  Columbus  Ohio  Buckeye State
33333  Peoria  Illinois  Wrigley Field
44444  Livingston  Montanta  Yellowstone Park
55555  Scottsdale  Arizona  Grand Canyon
$

The commonality of the first field was identified and the matching entries were combined. Whereas paste blindly took from each file to create the display, join will only combine lines that match and—of critical importance—it must be an exact match with the corresponding line in the other file. This point cannot be illustrated enough; for example, suppose filetwo had an additional line in the middle:

11111  Indiana  500 race
22222  Ohio  Buckeye State
66666  Tennessee  Smokey Mountains
33333  Illinois  Wrigley Field
44444  Montana  Yellowstone Park
55555  Arizona  Grand Canyon

Then the following (including prompts) would be the display generated:

$ join fileone filetwo
11111  Indianapolis  Indiana  500 race
22222  Columbus  Ohio  Buckeye State
$

As soon as the files no longer match, no further operations can be carried out. Each line is checked with the same—and only the same—line in the opposite file for a match on the default field. If matches are found, they are incorporated in the display; otherwise they are not. To illustrate one more time—using the original filetwo:

$ tac filetwo > filethree
$ join fileone filethree
55555  Scottsdale  Arizona  Grand Canyon
$

Even though a match exists for every line in both files, only one match is found.

NOTE

It is highly recommended that you overcome problems with join by first sorting each of the files to be used to get them in like order.

You don't have to keep the defaults with join from looking at only the first fields for matches or from outputting all columns. The -1 option lets you specify what field to use as the matching field in fileone, whereas the -2 option lets you specify what field to use as the matching field in filetwo. For example, if the second field of fileone were to match with the third field of filetwo, the syntax would be

$ join -1 2 -2 3 fileone filethree

The -o option is used to specify output fields in the format {file.field}. Thus to only print the second field of fileone and the third field of filetwo on matching lines, the syntax would be

$ join -o 1.2 2.3 fileone filethree
Indianapolis  500
Columbus  Buckeye
Peoria  Wrigley
Livingston  Yellowstone
Scottsdale  Grand
$

Sort, Count, Format, and Translate

It is often necessary to not only display text, but to manipulate and modify it a bit before the output is shown, or simply gather information on it. Four utilities are examined in this section: sort, wc, fmt, and tr.

sort

The sort utility sorts the lines of a file in alphabetical order, and displays the output. The importance of alphabetical order, versus any other, cannot be overstated. For example, assume that the fileone file contains the following lines:

Indianapolis Indiana
Columbus
Peoria
Livingston
Scottsdale
1
2
3
4
5
6
7
8
9
10
11
12

When a sort is done on the file, the result becomes

$ sort fileone
1
10
11
12
2
3
4
5
6
7
8
9
Columbus
Indianapolis Indiana
Livingston
Peoria
Scottsdale
$

The cities are "correctly" sorted in alphabetical order. The numbers, however, are also in alphabetical order, which puts every number starting with "1" before every number starting with "2," and then every number starting with "3," and so on.

Thankfully, the sort utility includes some options to add a great deal of flexibility to the output. Among those options are the following:

  • -d to sort in phone directory order (the same as that shown in the preceding example)

  • -f to sort lowercase letters the same as uppercase

  • -i to ignore any characters outside the ASCII range

  • -n to sort in numerical order versus alphabetical

  • -r to reverse the order of the output

Thus the display can be changed to

$ sort -n fileone
Columbus
Indianapolis Indiana
Livingston
Peoria
Scottsdale
1
2
3
4
5
6
7
8
9
10
11
12
$

NOTE

The sort utility assumes all blank lines to be a part of the display and always places them at the beginning of the output. To prevent blank lines from being sorted, use the -b option.

wc

The wc utility (named for "word count") displays information about the file in terms of three values: number of lines, words, and characters. The last entry in the output is the name of the file, thus the output would be

$ wc fileone
  17  18  86  fileone
$

You can choose to see only some of the output by using the following options:

  • -c to show only the number of bytes/characters

  • -l to see only the number of lines

  • -w to see only the number of words

In all cases, the name of the file still appears, for example

$ wc -l fileone
  17  fileone
$

The only way to override the name appearing is by using the standard input redirection:

$ wc -l < fileone
  17
$

fmt

The fmt utility formats the text by creating output to a specific width. The default width is 75 characters, but a different value can be specified with the -w option. Short lines are combined to create longer ones unless the -s option is used, and spacing is justified unless -u is used. The -u option enforces uniformity and places one space between words and two spaces at the end of each sentence.

The following example shows how the fileone lines are combined to create a 75-character display:

$ fmt fileone
Indianapolis Indiana Columbus Peoria Livingston Scottsdale 1 2 3 4 5 6
7 8 9 10 11 12
$

To change the output to 60 characters, use this example:

$ fmt -w60 fileone
Indianapolis Indiana Columbus Peoria Livingston Scottsdale
1 2 3 4 5 6 7 8 9 10 11 12
$

NOTE

The default for any option with fmt is -w, thus fmt -60 fileone will give the same result as fmt -w60 fileone.

tr

The tr (translate) utility can convert one set of characters to another. Use the following example to change all lowercase characters to uppercase:

$ tr '[a-z]' '[A-Z]' < fileone
INDIANAPOLIS INDIANA
COLUMBUS
PEORIA
LIVINGSTON
SCOTTSDALE
$

NOTE

It is extremely important to realize that the syntax of tr only accepts two character sets, not the name of the file. You must feed the name of the file into the utility by directing input (as in the example given), by piping to it (|), or using a similar operation.

Not only can you give character sets as string options, but you can also specify a number of unique values, including

  • lower—All lowercase characters

  • upper—All uppercase characters

  • print—All printable characters

  • punct—Punctuation characters

  • space—All whitespace (blank can be used for horizontal whitespace only)

  • alnum—Alpha characters and numbers

  • digit—Numbers only

  • cntrl—Control characters

  • alpha—Letters only

  • graph—Printable characters but not whitespace

For example, the output shown earlier can also be obtained like this:

$ tr '[:lower:]' '[:upper:]' < fileone
INDIANAPOLIS INDIANA
COLUMBUS
PEORIA
LIVINGSTON
SCOTTSDALE
$

Other Useful Utilities

A number of other useful text utilities are included with Linux. Some of these have limited usefulness and are intended only for a specific purpose, but are given because knowing of their existence and purpose can make your life with Linux considerably easier.

In alphabetical order, the additional utilities are as follows:

  • expand—Allows you to expand tab characters into spaces. The default number of spaces per tab is 8, but this can be changed using the -t option. The opposite of this utility is unexpand.

  • file—This utility will look at an entry's signature and report what type of file it is—ASCII text, GIF image, and so on. The definitions it returns (and thus the files it can correctly identify) are defined in a file called magic. This file typically resides in /usr/share/misc or /etc.

  • more—Used to display only one screen of output at a time.

  • od—Can perform an octal dump to show the contents of files other than ASCII text files. Used with the -x option, it does a hexadecimal dump, and with the -c option, it shows only recognizable ASCII characters.

  • pr—Converts the file into a format suitable for printed pages— including a default header with date and time of last modification, filename, and page numbers. The default header can be overwritten with the -h option, and the -l option allows you to specify the number of lines to include on each page—the default is 66. Default page width is 72 characters, but a different value can be specified with the -w option. The -d option can be used to double-space the output, and -m can be used to print numerous files in column format.

  • split—Chops a single file into multiple files. The default is that a new file is created for every 1,000 lines of the original file. Using the -b option, you can avoid the thousand-line splitting and specify a number of bytes to be put into each output file, or use -l to specify a number of lines.

  • uniq—This utility will examine entries in a file, comparing the current line with the one directly preceding it, to find lines that are unique.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020