3.2 Beer Ratings
Our second dataset is a collection of nearly 1.6 million ratings of beer, originally from RateBeer.com, a site for beer enthusiasts. This dataset is so large that it’s not feasible to study every line to be sure of its properties, so we have to rely on tools like Awk to explore and validate the data.
The data comes from Kaggle, a site for experimenting with machine-learning algorithms. You can find the original at https://www.kaggle.com/datasets/rdoume/-beerreviews; we are grateful to RateBeer, Kaggle, and the creator of the dataset itself for providing such an interesting collection of data.
Let’s start with some of the basic parameters: how big is the file and what does it look like? For a raw count, nothing beats the wc command:
$ time wc reviews.csv 1586615 12171013 180174429 reviews.csv real 0m0.629s user 0m0.585s sys 0m0.037s
Not surprisingly, wc is fast but as we’ve seen before, it’s easy to write a wc equivalent in Awk:
$ time awk '{ nc += length($0) + 1; nw += NF }
END { print NR, nw, nc, FILENAME }' reviews.csv
1586615 12170527 179963813 reviews.csv
real    0m9.402s
user    0m9.159s
sys     0m0.125s
Awk is an order of magnitude slower for this specific test. Awk is fast enough for most purposes, but there are times when other programs are more appropriate. Somewhat surprisingly, Gawk is five times faster, taking only 1.9 seconds.
Something else is more surprising, however: wc and Awk differ in the number of words and characters they count. We’ll dig into this later, but as a preview, wc is counting bytes (and thus implicitly assuming that the input is entirely ASCII), while Awk is counting Unicode UTF-8 characters. Here’s an example rating where the two programs come up with legitimately different answers:
95,Löwenbräu AG,1257106630,4,4,3,atis,Munich Helles Lager,4,4,
       Löwenbräu Urtyp,5.4,33038
UTF-8 is a variable-length encoding: ASCII characters are a single byte, and other languages use two or three bytes per character. The characters with umlauts are two bytes long in UTF-8. There are also some records with Asian characters, which are three bytes long. In such cases, wc will report more characters than Awk will.
The original data has 13 attributes but we will only use five of them here: brewery name, overall review, beer style, beer name, and alcohol content (percentage of alcohol by volume, or ABV). We created a new file with these attributes, and also converted the format from its original CSV to TSV by setting the output field separator OFS. This produces lines like this. (Long lines have been split into two, marked by a backslash at the end.)
Amstel Brouwerij B. V. 3.5 Light Lager Amstel Light 3.5 Bluegrass Brewing Co. 4 American Pale Ale (APA) American Pale Ale 5.79 Hoppin' Frog Brewery 2.5 Winter Warmer Frosted Frog Christmas Ale 8.6
This shrinks the file from 180 megabytes to 113 megabytes, still large but more manageable.
We can see a wide range of ABV values in these sample lines, which suggests a question: What’s the maximum value, the strongest beer that has been reviewed? This is easily answered with this program:
NR > 1 && $5 > maxabv { maxabv = $5; brewery = $1; name = $4 }
END { print maxabv, brewery, name }
which produces
57.7 Schorschbräu Schorschbräu Schorschbock 57%
This value is stunningly high, about 10 times the content of normal beer, so on the surface it looks like a data error. But a trip to the web confirms its legitimacy. That raises a follow-up question, whether this value is a real outlier, or merely the tip of a substantial alcoholic iceberg. If we look for brews of say 10 percent or more:
$5 >= 10 { print $1, $4, $5 }
we get over 195,000 reviews, which suggests that high-alcohol beer is popular, at least among people who contribute to RateBeer.
Of course that raises yet more questions, this time about low-alcohol beer. What about beer with less than say 0.5 percent, which is the legal definition of alcohol-free, at least in parts of the USA?
$5 <= 0.5 { print $1, $4, $5 }
This produces only 68,800 reviews, which suggests that low-alcohol beer is significantly less popular.
What ratings are associated with high and low alcohol?
$ awk -F'\t' '$5 >= 10 {rate += $2; nrate++}
    END {print rate/nrate, nrate}' rev.tsv
3.93702 194359
$ awk -F'\t' '$5 <= 0.5 {rate += $2; nrate++}
    END {print rate/nrate, nrate}' rev.tsv
3.61408 68808
$ awk -F'\t' '{rate += $2; nrate++}
    END {print rate/nrate, nrate}' rev.tsv
3.81558 1586615
This may or may not be statistically significant, but the average rating of high-alcohol beers is higher than the overall average rating, which in turn is higher than low-alcohol beers. (This is consistent with the personal preferences of at least one of the authors.)
But wait! Further checking reveals that there are 67,800 reviews that don’t list an ABV at all; the field is empty! Let’s re-run the low-alcohol computation with a proper test:
$ awk -F'\t' '$5 != "" && $5 <= 0.5 {rate += $2; nrate++}
    END {print rate/nrate, nrate}' rev.tsv
2.58895 1023
One doesn’t have to be a beer aficionado to guess that beer without alcohol isn’t going to be popular or highly rated.
The moral of these examples is that one has to look at all the data carefully. How many fields are empty or have an explicitly non-useful value like “N/A”? What is the range of values in a column? What are the distinct values? Answering such questions should be part of the initial exploration, and creating some simple scripts to automate the process can be a good investment.