Home > Articles

Exploratory Data Analysis

This chapter is from the book

The previous chapter described a number of small scripts for personal use, often idiosyncratic or specialized. In this chapter, we’re going to do something that is also typical of how Awk is used in real life: we’ll use it along with other tools to informally explore some real data, with the goal of seeing what it looks like. This is called exploratory data analysis or EDA, a term first used by the pioneering statistician John Tukey.

Tukey invented a number of basic data visualization techniques like boxplots, inspired the statistical programming language S that led to the widely-used R language, co-invented the Fast Fourier Transform, and coined the words “bit” and “software.” The authors knew John Tukey as a friend and colleague at Bell Labs in the 1970s and 1980s, where among a large number of very smart and creative people, he stood out as someone special.

The essence of exploratory data analysis is to play with the data before making hypotheses or drawing conclusions. As Tukey himself said,

“Finding the question is often more important than finding the answer. Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques.”

In many cases, that involves counting things, computing simple statistics, arranging data in different ways, looking for patterns, commonalities, outliers and oddities, and drawing basic graphs and other visual displays. The emphasis is on small, quick experiments that might give some insight, rather than polish or refinement; those come later when we have a better sense of what the data might be telling us.

For EDA, we typically use standard Unix tools like the shell, wc, diff, sort, uniq, grep, and of course regular expressions. These combine well with Awk, and often with other languages like Python.

We will also encounter a variety of file formats, including comma- or tab-separated values (CSV and TSV), JSON, HTML, and XML. Some of these, like CSV and TSV, are easily processed in Awk, while others are sometimes better handled with other tools.

3.1 The Sinking of the Titanic

Our first dataset is based on the sinking of the Titanic on April 15, 1912. This example was chosen, not entirely by coincidence, by one of the authors, who was at the time on a trans-Atlantic boat trip, passing not far from the site where the Titanic sank.

Summary Data: titanic.tsv

The file titanic.tsv, adapted from Wikipedia, contains summary data about the Titanic’s passengers and crew. As is common with datasets in CSV and TSV format, the first line is a header that identifies the data in the lines that follow. Columns are separated by tabs.

Type    Class   Total   Lived   Died
Male    First   175     57      118
Male    Second  168     14      154
Male    Third   462     75      387
Male    Crew    885     192     693
Female  First   144     140     4
Female  Second  93      80      13
Female  Third   165     76      89
Female  Crew    23      20      3
Child   First   6       5       1
Child   Second  24      24      0
Child   Third   79      27      52

Many (perhaps all) datasets contain errors. As a quick check here, each line should have five fields, and the total in the third field should equal field four (lived) plus field five (died). This program prints any line where those conditions do not hold:

NF != 5 || $3 != $4 + $5

If the data is in the right format and the numbers are correct, this should produce a single line of output, the header:

Type    Class    Total    Lived    Died

Once we’ve done this minimal check, we can look at other things. For example, how many people are there in each category?

The categories that we want to count are not identified by numbers, but by words like Male and Crew. Fortunately, the subscripts or indices of Awk arrays can be arbitrary strings of characters, so gender["Male"] and class["Crew"] are valid expressions.

Arrays that allow arbitrary strings as subscripts are called associative arrays; other languages provide the same facility with names like dictionary, map or hashmap. Associative arrays are remarkably convenient and flexible, and we will use them extensively.

NR > 1 { gender[$1] += $3; class[$2] += $3 }

END {
  for (i in gender) print i, gender[i]
  print ""
  for (i in class) print i, class[i]
}

gives

Male 1690
Child 109
Female 425

Crew 908
First 325
Third 706
Second 285

Awk has a special form of the for statement for iterating over the indices of an associative array:

for (i in array) { statements }

sets the variable i in turn to each index of the array, and the statements are executed with that value of i. The elements of the array are visited in an unspecified order; you can’t count on any particular order.

What about survival rates? How did social class, gender and age affect the chance of survival among passengers? With this summary data we can do some simple experiments, for example, computing the survival rate for each category.

NR > 1 { printf("%6s  %6s  %6.1f%%\n", $1, $2, 100 * $4/$3) }

We can sort the output of this test by piping it through the Unix command sort -k3 -nr (sort by third field in reverse numeric order) to produce

 Child  Second    100.0%
Female   First     97.2%
Female    Crew     87.0%
Female  Second     86.0%
 Child   First     83.3%
Female   Third     46.1%
 Child   Third     34.2%
  Male   First     32.6%
  Male    Crew     21.7%
  Male   Third     16.2%
  Male  Second      8.3%

Evidently women and children did survive better on average.

Note that these examples treat the header line of the dataset as a special case. If you’re doing a lot of experiments, it may be easier to remove the header from the data file than to ignore it explicitly in every program.

Passenger Data: passengers.csv

The file passengers.csv is a larger file that contains detailed information about passengers, though it does not contain anything about crew members. The original file is a merger of a widely used machine-learning dataset with another list from Wikipedia. It has 11 columns including home town, lifeboat assignment, and ticket price:

"row.names","pclass","survived","name","age","embarked",
   "home.dest","room","ticket","boat","sex"
...
"11","1st",0,"Astor, Colonel John Jacob",47,"Cherbourg",
   "New York, NY","","17754 L224 10s 6d","(124)","male"
...

How big is the file? We can use the Unix wc command to count lines, words and characters:

$ wc passengers.csv
    1314    6794  112466 passengers.csv

or a two-line Awk program like the one we saw in Chapter 1:

    { nc += length($0) + 1; nw += NF }
END { print NR, nw, nc, FILENAME }

Except for spacing, they produce the same results when the input is a single file.

The file format of passengers.csv is comma-separated values. Although CSV is not rigorously defined, one common definition says that any field that contains a comma or a double quote (") must be surrounded by double quotes. Any field may be surrounded by quotes, whether it contains commas and quotes or not. An empty field is just "", and a quote within a field is represented by a doubled quote, as in """,""", which represents ",". Input fields in CSV files may contain newline characters. For more details, see Section A.5.2.

This is more or less the format used by Microsoft Excel and other spreadsheet programs like Apple Numbers and Google Sheets. It is also the default input format for data frames in Python’s Pandas library and in R.

In versions of Awk since 2023, the command-line argument --csv causes input lines to be split into fields according to this rule. Setting the field separator to a comma explicitly with FS=, does not treat comma field separators specially, so this is useful only for the simplest form of CSV: no quotes. With older versions of Awk it may be easiest to convert the data to a different form using some other system, like an Excel spreadsheet or a Python CSV module.

Another useful alternative format is tab-separated values or TSV. The idea is the same, but simpler: fields are separated by single tabs, and there is no quoting mechanism so fields may not contain embedded tabs or newlines. This format is easily handled by Awk, by setting the field separator to a tab with FS="\t" or equivalently with the command-line argument -F"\t".

As an aside, it’s wise to verify whether a file is in the proper format before relying on its contents. For example, to check whether all records have the same number of fields, you could use

awk '{print NF}' file | sort | uniq -c | sort -nr

The first sort command brings all instances of a particular value together; then the command uniq -c replaces each sequence of identical values by a single line with a count and the value; and finally sort -nr sorts the result numerically in reverse order, so the largest values come first.

For passengers.csv, using the --csv option to process CSV input properly, this produces

1314 11

Every record has the same number of fields, which is necessary for valid data in this dataset, though not sufficient. If some lines have different numbers of fields, now use Awk to find them, for example with NF != 11 in this case.

With a version of Awk that does not handle CSV, the output using -F, will be different:

624 12
517 13
155 14
 15 15
  3 11

This shows that almost all fields contain embedded commas.

By the way, generating CSV is straightforward. Here’s a function to_csv that converts a string to a properly quoted string by doubling each quote and surrounding the result with quotes. It’s an example of a function that could go into a personal library.

# to_csv - convert s to proper "..."

function to_csv(s) {
  gsub(/"/, "\"\"", s)
  return "\"" s "\""
}

(Note how quotes are quoted with backslashes.)

We can use this function within a loop to insert commas between elements of an array to create a properly formatted CSV record for an associative array, or for an indexed array like the fields of a line, as illustrated in the functions rec_to_csv and arr_to_csv:

# rec_to_csv - convert a record to csv

function rec_to_csv(   s, i) {
  for (i = 1; i < NF; i++)
    s = s to_csv($i) ","
  s = s to_csv($NF)
  return s
}

# arr_to_csv - convert an indexed array to csv

function arr_to_csv(arr,   s, i, n) {
  n = length(arr)
  for (i = 1; i <= n; i++)
    s = s to_csv(arr[i]) ","
  return substr(s, 1, length(s)-1) # remove trailing comma
}

The following program selects the five attributes class, survival, name, age, and gender, from the original file, and converts the output to tab-separated values.

NR > 1 { OFS="\t"; print $2, $3, $4, $5, $11 }

It produces output like this:

1st 0   Allison, Miss Helen Loraine  2  female
1st 0   Allison, Mr Hudson Joshua Creighton 30  male
1st 0   Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25  female
1st 1   Allison, Master Hudson Trevor   0.9167 male

Most ages are integers, but a handful are fractions, like the last line above. Helen Allison was two years old; Master Hudson Allison appears to have been 11 months old, and was the only survivor in his family. (From other sources, we know that the Allison’s chauffeur, George Swane, age 18, also died, but the family’s maid and cook both survived.)

How many infants were there? Running the command

$4 < 1

with tab as the field separator produces eight lines:

1st 1   Allison, Master Hudson Trevor    0.9167 male
2nd 1   Caldwell, Master Alden Gates     0.8333 male
2nd 1   Richards, Master George Sidney   0.8333 male
3rd 1   Aks, Master Philip   0.8333 male
3rd 0   Danbom, Master Gilbert Sigvard Emanuel   0.3333 male
3rd 1   Dean, Miss Elizabeth Gladys (Millvena)   0.1667 female
3rd 0   Peacock, Master Alfred Edward   0.5833  male
3rd 0   Thomas, Master Assad Alexander  0.4167  male

Exercise 3-1. Modify the word count program to produce a separate count for each of its input files, as the Unix wc command does.

Some Further Checking

Another set of questions to explore is how well the two data sources agree. They both come from Wikipedia, but it is not always a perfectly accurate source. Suppose we check something absolutely basic, like how many passengers there were in the passengers file:

$ awk 'END {print NR}' passengers.csv
1314

This count includes one header line, so there were 1313 passengers. On the other hand, this program adds up the counts for non-crew members from the third field of the summary file:

$ awk '!/Crew/ { s += $3 }; END { print s }' titanic.tsv
1316

That’s a discrepancy of three people, so something is wrong.

As another example, how many children were there?

awk --csv '$5 <= 12' passengers.csv

produces 100 lines, which doesn’t match the 109 children in titanic.tsv. Perhaps children are those 13 or younger? That gives 105. Younger than 14? That’s 112. We can guess what age is being used by counting passengers who are called “Master”:

awk --csv '/Master/ {print $5}' passengers.csv | sort -n

The largest age in this population is 13, so that’s perhaps the best guess, though not definitive.

In both of these cases, numbers that ought to be the same are in fact different, which suggests that the data is still flaky. When exploring data, you should always be prepared for errors and inconsistencies in form and content. A big part of the job is to be sure that you have identified and dealt with potential problems before starting to draw conclusions.

In this section, we’ve tried to show how simple computations can help identify such problems. If you collect a set of tools for common operations, like isolating fields, grouping by category, printing the most common and least common entries, and so on, you’ll be better able to perform such checks.

Exercise 3-2. Write some of these tools for yourself, according to your own needs and tastes.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.