Exploratory Data Analysis

Dec 23, 2023

📄 Contents

␡

⎙ Print

Page 1 of 6 Next >

This chapter is from the book 

The AWK Programming Language, 2nd Edition

Learn More Buy

The previous chapter described a number of small scripts for personal use, often idiosyncratic or specialized. In this chapter, we’re going to do something that is also typical of how Awk is used in real life: we’ll use it along with other tools to informally explore some real data, with the goal of seeing what it looks like. This is called exploratory data analysis or EDA, a term first used by the pioneering statistician John Tukey.

Tukey invented a number of basic data visualization techniques like boxplots, inspired the statistical programming language S that led to the widely-used R language, co-invented the Fast Fourier Transform, and coined the words “bit” and “software.” The authors knew John Tukey as a friend and colleague at Bell Labs in the 1970s and 1980s, where among a large number of very smart and creative people, he stood out as someone special.

The essence of exploratory data analysis is to play with the data before making hypotheses or drawing conclusions. As Tukey himself said,

“Finding the question is often more important than finding the answer. Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques.”

In many cases, that involves counting things, computing simple statistics, arranging data in different ways, looking for patterns, commonalities, outliers and oddities, and drawing basic graphs and other visual displays. The emphasis is on small, quick experiments that might give some insight, rather than polish or refinement; those come later when we have a better sense of what the data might be telling us.

For EDA, we typically use standard Unix tools like the shell, wc, diff, sort, uniq, grep, and of course regular expressions. These combine well with Awk, and often with other languages like Python.

We will also encounter a variety of file formats, including comma- or tab-separated values (CSV and TSV), JSON, HTML, and XML. Some of these, like CSV and TSV, are easily processed in Awk, while others are sometimes better handled with other tools.

3.1 The Sinking of the Titanic

Our first dataset is based on the sinking of the Titanic on April 15, 1912. This example was chosen, not entirely by coincidence, by one of the authors, who was at the time on a trans-Atlantic boat trip, passing not far from the site where the Titanic sank.

Summary Data: titanic.tsv

The file titanic.tsv, adapted from Wikipedia, contains summary data about the Titanic’s passengers and crew. As is common with datasets in CSV and TSV format, the first line is a header that identifies the data in the lines that follow. Columns are separated by tabs.

Type    Class   Total   Lived   Died
Male    First   175     57      118
Male    Second  168     14      154
Male    Third   462     75      387
Male    Crew    885     192     693
Female  First   144     140     4
Female  Second  93      80      13
Female  Third   165     76      89
Female  Crew    23      20      3
Child   First   6       5       1
Child   Second  24      24      0
Child   Third   79      27      52

Many (perhaps all) datasets contain errors. As a quick check here, each line should have five fields, and the total in the third field should equal field four (lived) plus field five (died). This program prints any line where those conditions do not hold:

NF != 5 || $3 != $4 + $5

If the data is in the right format and the numbers are correct, this should produce a single line of output, the header:

Type    Class    Total    Lived    Died

Once we’ve done this minimal check, we can look at other things. For example, how many people are there in each category?

The categories that we want to count are not identified by numbers, but by words like Male and Crew. Fortunately, the subscripts or indices of Awk arrays can be arbitrary strings of characters, so gender["Male"] and class["Crew"] are valid expressions.

Arrays that allow arbitrary strings as subscripts are called associative arrays; other languages provide the same facility with names like dictionary, map or hashmap. Associative arrays are remarkably convenient and flexible, and we will use them extensively.

NR > 1 { gender[$1] += $3; class[$2] += $3 }

END {
  for (i in gender) print i, gender[i]
  print ""
  for (i in class) print i, class[i]
}

gives

Male 1690
Child 109
Female 425

Crew 908
First 325
Third 706
Second 285

Awk has a special form of the for statement for iterating over the indices of an associative array:

for (i in array) { statements }

sets the variable i in turn to each index of the array, and the statements are executed with that value of i. The elements of the array are visited in an unspecified order; you can’t count on any particular order.

What about survival rates? How did social class, gender and age affect the chance of survival among passengers? With this summary data we can do some simple experiments, for example, computing the survival rate for each category.

NR > 1 { printf("%6s  %6s  %6.1f%%\n", $1, $2, 100 * $4/$3) }

We can sort the output of this test by piping it through the Unix command sort -k3 -nr (sort by third field in reverse numeric order) to produce

 Child  Second    100.0%
Female   First     97.2%
Female    Crew     87.0%
Female  Second     86.0%
 Child   First     83.3%
Female   Third     46.1%
 Child   Third     34.2%
  Male   First     32.6%
  Male    Crew     21.7%
  Male   Third     16.2%
  Male  Second      8.3%

Evidently women and children did survive better on average.

Note that these examples treat the header line of the dataset as a special case. If you’re doing a lot of experiments, it may be easier to remove the header from the data file than to ignore it explicitly in every program.

Passenger Data: passengers.csv

The file passengers.csv is a larger file that contains detailed information about passengers, though it does not contain anything about crew members. The original file is a merger of a widely used machine-learning dataset with another list from Wikipedia. It has 11 columns including home town, lifeboat assignment, and ticket price:

"row.names","pclass","survived","name","age","embarked",
   "home.dest","room","ticket","boat","sex"
...
"11","1st",0,"Astor, Colonel John Jacob",47,"Cherbourg",
   "New York, NY","","17754 L224 10s 6d","(124)","male"
...

How big is the file? We can use the Unix wc command to count lines, words and characters:

$ wc passengers.csv
    1314    6794  112466 passengers.csv

or a two-line Awk program like the one we saw in Chapter 1:

    { nc += length($0) + 1; nw += NF }
END { print NR, nw, nc, FILENAME }

Except for spacing, they produce the same results when the input is a single file.

The file format of passengers.csv is comma-separated values. Although CSV is not rigorously defined, one common definition says that any field that contains a comma or a double quote (") must be surrounded by double quotes. Any field may be surrounded by quotes, whether it contains commas and quotes or not. An empty field is just "", and a quote within a field is represented by a doubled quote, as in """,""", which represents ",". Input fields in CSV files may contain newline characters. For more details, see Section A.5.2.

This is more or less the format used by Microsoft Excel and other spreadsheet programs like Apple Numbers and Google Sheets. It is also the default input format for data frames in Python’s Pandas library and in R.

In versions of Awk since 2023, the command-line argument --csv causes input lines to be split into fields according to this rule. Setting the field separator to a comma explicitly with FS=, does not treat comma field separators specially, so this is useful only for the simplest form of CSV: no quotes. With older versions of Awk it may be easiest to convert the data to a different form using some other system, like an Excel spreadsheet or a Python CSV module.

Another useful alternative format is tab-separated values or TSV. The idea is the same, but simpler: fields are separated by single tabs, and there is no quoting mechanism so fields may not contain embedded tabs or newlines. This format is easily handled by Awk, by setting the field separator to a tab with FS="\t" or equivalently with the command-line argument -F"\t".

As an aside, it’s wise to verify whether a file is in the proper format before relying on its contents. For example, to check whether all records have the same number of fields, you could use

awk '{print NF}' file | sort | uniq -c | sort -nr

The first sort command brings all instances of a particular value together; then the command uniq -c replaces each sequence of identical values by a single line with a count and the value; and finally sort -nr sorts the result numerically in reverse order, so the largest values come first.

For passengers.csv, using the --csv option to process CSV input properly, this produces

1314 11

Every record has the same number of fields, which is necessary for valid data in this dataset, though not sufficient. If some lines have different numbers of fields, now use Awk to find them, for example with NF != 11 in this case.

With a version of Awk that does not handle CSV, the output using -F, will be different:

This shows that almost all fields contain embedded commas.

By the way, generating CSV is straightforward. Here’s a function to_csv that converts a string to a properly quoted string by doubling each quote and surrounding the result with quotes. It’s an example of a function that could go into a personal library.

# to_csv - convert s to proper "..."

function to_csv(s) {
  gsub(/"/, "\"\"", s)
  return "\"" s "\""
}

(Note how quotes are quoted with backslashes.)

We can use this function within a loop to insert commas between elements of an array to create a properly formatted CSV record for an associative array, or for an indexed array like the fields of a line, as illustrated in the functions rec_to_csv and arr_to_csv:

# rec_to_csv - convert a record to csv

function rec_to_csv(   s, i) {
  for (i = 1; i < NF; i++)
    s = s to_csv($i) ","
  s = s to_csv($NF)
  return s
}

# arr_to_csv - convert an indexed array to csv

function arr_to_csv(arr,   s, i, n) {
  n = length(arr)
  for (i = 1; i <= n; i++)
    s = s to_csv(arr[i]) ","
  return substr(s, 1, length(s)-1) # remove trailing comma
}

The following program selects the five attributes class, survival, name, age, and gender, from the original file, and converts the output to tab-separated values.

NR > 1 { OFS="\t"; print $2, $3, $4, $5, $11 }

It produces output like this:

1st 0   Allison, Miss Helen Loraine  2  female
1st 0   Allison, Mr Hudson Joshua Creighton 30  male
1st 0   Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25  female
1st 1   Allison, Master Hudson Trevor   0.9167 male

Most ages are integers, but a handful are fractions, like the last line above. Helen Allison was two years old; Master Hudson Allison appears to have been 11 months old, and was the only survivor in his family. (From other sources, we know that the Allison’s chauffeur, George Swane, age 18, also died, but the family’s maid and cook both survived.)

How many infants were there? Running the command

$4 < 1

with tab as the field separator produces eight lines:

1st 1   Allison, Master Hudson Trevor    0.9167 male
2nd 1   Caldwell, Master Alden Gates     0.8333 male
2nd 1   Richards, Master George Sidney   0.8333 male
3rd 1   Aks, Master Philip   0.8333 male
3rd 0   Danbom, Master Gilbert Sigvard Emanuel   0.3333 male
3rd 1   Dean, Miss Elizabeth Gladys (Millvena)   0.1667 female
3rd 0   Peacock, Master Alfred Edward   0.5833  male
3rd 0   Thomas, Master Assad Alexander  0.4167  male

Exercise 3-1. Modify the word count program to produce a separate count for each of its input files, as the Unix wc command does.

Some Further Checking

Another set of questions to explore is how well the two data sources agree. They both come from Wikipedia, but it is not always a perfectly accurate source. Suppose we check something absolutely basic, like how many passengers there were in the passengers file:

$ awk 'END {print NR}' passengers.csv
1314

This count includes one header line, so there were 1313 passengers. On the other hand, this program adds up the counts for non-crew members from the third field of the summary file:

$ awk '!/Crew/ { s += $3 }; END { print s }' titanic.tsv
1316

That’s a discrepancy of three people, so something is wrong.

As another example, how many children were there?

awk --csv '$5 <= 12' passengers.csv

produces 100 lines, which doesn’t match the 109 children in titanic.tsv. Perhaps children are those 13 or younger? That gives 105. Younger than 14? That’s 112. We can guess what age is being used by counting passengers who are called “Master”:

awk --csv '/Master/ {print $5}' passengers.csv | sort -n

The largest age in this population is 13, so that’s perhaps the best guess, though not definitive.

In both of these cases, numbers that ought to be the same are in fact different, which suggests that the data is still flaky. When exploring data, you should always be prepared for errors and inconsistencies in form and content. A big part of the job is to be sure that you have identified and dealt with potential problems before starting to draw conclusions.

In this section, we’ve tried to show how simple computations can help identify such problems. If you collect a set of tools for common operations, like isolating fields, grouping by category, printing the most common and least common entries, and so on, you’ll be better able to perform such checks.

Exercise 3-2. Write some of these tools for yourself, according to your own needs and tastes.

Page 1 of 6 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Privacy Notice

Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children

This site is not directed to children under the age of 13.

Marketing

Pearson may send or direct marketing communications to users, provided that

Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
Such marketing is consistent with applicable law and Pearson's legal obligations.
Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

As required by law.
With the consent of the individual (or their parent, if the individual is a minor)
In response to a subpoena, court order or legal process, to the extent permitted or required by law
To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
To investigate or address actual or suspected fraud or other illegal activities
To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020

Email Address