3.4 Unicode Data
As befits a drink that knows no national boundaries, the names of beers use many non-ASCII characters. The Awk program charfreq counts the number of times each distinct Unicode code point occurs in the input. (A code point is often a character, but some characters are made up of multiple code points.)
# charfreq - count frequency of characters in input awk ' { n = split($0, ch, "") for (i = 1; i <= n; i++) tab[ch[i]]++ } END { for (i in tab) print i "\t" tab[i] } ' $* | sort -k2 -nr
Splitting each line with an empty string as the field separator puts each character into a separate element of an array ch, and those characters are counted in tab; the accumulated counts are displayed at the end, sorted into decreasing frequency order.
This program is not very fast on this data, taking 250 seconds on a 2015 MacBook Air. Here’s an alternate version that’s more than twice as fast, just under 105 seconds:
# charfreq2 - alternate version of charfreq awk ' { n = length($0) for (i = 1; i <= n; i++) tab[substr($0, i, 1)]++ } END { for (i in tab) print i "\t" tab[i] } ' $* | sort -k2 -nr
Rather than using split, it extracts the characters one at a time with substr. The substring function substr (s,m,n) returns the substring of s of length n that begins at position m (starting at 1), or the empty string if the range implied by m and n is outside the string. If n is omitted, the substring extends to the end of s. Full details are in Section A.2.1 of the reference manual.
Gawk, the GNU version of Awk, is again much faster: 72 seconds for the first version and 42 seconds for the second.
What about another language? For comparison, we wrote a simple Python version of charfreq:
# charfreq - count frequency of characters in input freq = {} with open('../beer/reviews.csv', encoding='utf-8') as f: for ch in f.read(): if ch == '\n': continue if ch in freq: freq[ch] += 1 else: freq[ch] = 1 for ch in freq: print(ch, freq[ch])
The Python version takes 45 seconds, so it’s about the same as Gawk, at the price of having to write explicit file-handling code. (The authors are not Pythonistas, so this program can surely be improved.)
There are 195 distinct characters in the file, excluding the newline at the end of each line. The most frequent character is a space, followed by printable characters:
10586176 , 19094985 e 12308925 r 8311408 4 7269630 a 7014111 5 6993858 ...
There are quite a few characters from European languages, like umlauts from German, and a modest number of Japanese and Chinese characters:
1 1 1 1 1 1 229
The final character is (hēi, black), which appears in the name of a potent Imperial stout called simply “Black,” with the Chinese character as its alternate name:
Mikkeller ApS,2, American Double / Imperial Stout, Black (), 17.5