3.3 Grouping Data
Let’s take a look at the question of how many distinct values there are in a dataset. The sequence we showed above with sort and uniq -c is run so frequently that it probably ought to be a script, though at this point we’ve used it so many times that we can type it quickly and accurately. Here are some “distinct value” questions for the Titanic data, which we’ll use because it’s smaller.
How many male passengers and female passengers are there?
$ awk --csv '{g[$11]++} END {for (i in g) print i, g[i]}' passengers.csv female 463 sex 1 male 850
That seems right — “sex” is the column header, and all the other values are either male or female, as expected. Very similar programs could check passenger classes, survival status, and age. For instance, checking ages reveals that no age is given for 258 of the 1313 passengers.
If we count the number of different ages with
$ awk --csv '{g[$5]++} END {for (i in g) print i, g[i]}' passengers.csv | sort -n
we see a sequence of lines like this:
... 1 4 1 4 2 6 2 7 3 6 3 2 ...
About half of the age fields contain a spurious space! That could easily throw off some future computation if it’s not corrected.
More generally, sorting is a powerful technique for spotting anomalies in data, because it brings together pieces of text that share a common prefix but differ thereafter. We can see an example if we try to count honorifics, like Mr or Colonel. A quick list can be produced by printing the second word of the name field; this catches most of the obvious ones:
$ awk --csv '{split($4, name, " ") print name[2]}' passengers.csv | sort | uniq -c | sort -nr 728 Mr 229 Miss 191 Mrs 56 Master 16 Ms 7 Dr 6 Rev ... $
This produces a long tail of spurious non-honorifics, but also suggests places where the program could be improved; for example, removing punctuation would eliminate these differences:
6 Rev 1 Rev. 1 Mlle. 1 Mlle
This experiment also reveals one Colonel and one Col, presumably both referring to the same rank.
It’s also interesting that Ms was in use more than 50 years before it became common in modern times, though we don’t know what social status or condition it was meant to indicate.
In a similar vein, we can answer questions like how many breweries, beer styles, and reviewers are in the beer dataset:
{ brewery[$2]++; style[$8]++; reviewer[$7]++ } END { print length(brewery), "breweries," length(style), "styles," length(reviewer), "reviewers" }
produces
5744 breweries, 105 styles, 33389 reviewers
When applied to an array, the function length returns the number of elements.
Variations of this code can answer questions like how popular the various styles are:
{ style[$8]++ } END { for (i in style) print style[i], i }
yields (when sorted and run through the head and tail program of Section 2.2)
117586 American IPA 85977 American Double / Imperial IPA 63469 American Pale Ale (APA) 54129 Russian Imperial Stout 50705 American Double / Imperial Stout ... 686 Gose 609 Faro 466 Roggenbier 297 Kvass 241 Happoshu
If you’re going to do much of this kind of selecting fields and computing their statistics, it might be worth writing a handful of short scripts, rather like those we talked about in Chapter 2. One script could select a particular field, while a separate script could do the sorting and uniquing.