Home > Articles

This chapter is from the book

3.3 Grouping Data

Let’s take a look at the question of how many distinct values there are in a dataset. The sequence we showed above with sort and uniq -c is run so frequently that it probably ought to be a script, though at this point we’ve used it so many times that we can type it quickly and accurately. Here are some “distinct value” questions for the Titanic data, which we’ll use because it’s smaller.

How many male passengers and female passengers are there?

$ awk --csv '{g[$11]++}
     END {for (i in g) print i, g[i]}' passengers.csv
female 463
sex 1
male 850

That seems right — “sex” is the column header, and all the other values are either male or female, as expected. Very similar programs could check passenger classes, survival status, and age. For instance, checking ages reveals that no age is given for 258 of the 1313 passengers.

If we count the number of different ages with

$ awk --csv '{g[$5]++}
      END {for (i in g) print i, g[i]}' passengers.csv | sort -n

we see a sequence of lines like this:

...
 1 4
1 4
 2 6
2 7
 3 6
3 2
...

About half of the age fields contain a spurious space! That could easily throw off some future computation if it’s not corrected.

More generally, sorting is a powerful technique for spotting anomalies in data, because it brings together pieces of text that share a common prefix but differ thereafter. We can see an example if we try to count honorifics, like Mr or Colonel. A quick list can be produced by printing the second word of the name field; this catches most of the obvious ones:

$ awk --csv '{split($4, name, " ")
   print name[2]}' passengers.csv | sort | uniq -c | sort -nr
 728 Mr
 229 Miss
 191 Mrs
  56 Master
  16 Ms
   7 Dr
   6 Rev
   ...
$

This produces a long tail of spurious non-honorifics, but also suggests places where the program could be improved; for example, removing punctuation would eliminate these differences:

6 Rev
1 Rev.
1 Mlle.
1 Mlle

This experiment also reveals one Colonel and one Col, presumably both referring to the same rank.

It’s also interesting that Ms was in use more than 50 years before it became common in modern times, though we don’t know what social status or condition it was meant to indicate.

In a similar vein, we can answer questions like how many breweries, beer styles, and reviewers are in the beer dataset:

{ brewery[$2]++; style[$8]++; reviewer[$7]++ }
END { print length(brewery), "breweries," length(style), "styles,"
            length(reviewer), "reviewers" }

produces

5744 breweries, 105 styles, 33389 reviewers

When applied to an array, the function length returns the number of elements.

Variations of this code can answer questions like how popular the various styles are:

{ style[$8]++ }
END { for (i in style) print style[i], i }

yields (when sorted and run through the head and tail program of Section 2.2)

117586 American IPA
85977 American Double / Imperial IPA
63469 American Pale Ale (APA)
54129 Russian Imperial Stout
50705 American Double / Imperial Stout

...

686 Gose
609 Faro
466 Roggenbier
297 Kvass
241 Happoshu

If you’re going to do much of this kind of selecting fields and computing their statistics, it might be worth writing a handful of short scripts, rather like those we talked about in Chapter 2. One script could select a particular field, while a separate script could do the sorting and uniquing.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.