Home > Articles > Programming > C/C++

  • Print
  • + Share This
This chapter is from the book

Case Study: Histograms

Suppose that you are presented with a large body of text. Is it possible to tell interesting things about the authors from the typical length of their sentences, their words, and so on? Some literary detectives think so; in this case study, your job is to collect these statistics and present them as a bar graph, otherwise known as a histogram. You already have one tool for the job—C++ input streams, which read in strings separated by spaces. C++ input streams are well suited to this task because they are not line based; text is naturally parsed into tokens, which are usually (but not always) words. For example, "He said, let's go." is read as {"He" "said," "let's" "go."}. So when you read in a token, you need to look at the end of each word for punctuation. The basic strategy is to read each word and punctuation character and to increment word and sentence frequency tables. These tables don't have to be very long, so arrays will work well.

First, the arrays need to be initialized to zero (never assume that this is already true!). The standard algorithm fill_n() is easier to use in this case than an explicit for loop:

const int MAXWORD = 30, MAXSENTENCE = 120;

int word_lengths[MAXWORD];
int sentence_lengths[MAXSENTENCE];

void init_arrays() 
{
 fill_n(word_lengths,MAXWORD,0);
 fill_n(sentence_lengths,MAXSENTENCE,0);
}

Here is collect_file_stats(), which reads each word of the text using >>. Then it looks at the last character of the word, which is word[len-1] because strings are like arrays. If this character is punctuation, it must be removed from the string. The word length frequency table can be updated with a single statement: ++word_lengths[len]. The next few lines count words in sentences.

bool collect_file_stats(string file, int word_lengths[], 
                        int sentence_lengths[])
 {
  ifstream in(file.c_str());
  if (! in) return false; // sorry, can't open file!
  int wcount = 0;  // will count no. of words in each sentence
  string word;
  while (in >> word) {
    int len = word.length();
    char ech = word[len-1];   // last character of word
    if (ech == `.' || ech == `,' || ech == `;') {
       word = word.substr(0,len-1); // chop off punctuation
       —len;
    }
    ++word_lengths[len];
    ++wcount;
    if (ech == `.') {
      ++sentence_lengths[wcount];
      wcount = 0;
    } 
  } // end of while(...)
 } // end of collect_file_stats
;> init_arrays();
;> collect_file_stats("chap1.txt",word_lengths,sentence_lengths);
;> show_arr(word_lengths,15);
0 266 854 866 806 472 312 316 232 106 104 56 62 30 4
;> show_arr(sentence_lengths,15);
0 4 0 10 0 4 12 8 6 6 4 4 6 4 8

The result of using this code is two frequency tables, word_lengths and sentence_lengths. word_lengths[1] shows the number of words with one character, word_lengths[2] shows the number of words with two characters, up to about 15 or so characters. I have used the useful function show_arr() from earlier in the chapter to show you the values when the first chapter of The Hound of the Baskervilles is analyzed.

Two Approaches to Histograms

The easiest way to print out a histogram is to print it on its side. You can fill strings or character arrays with a chosen character, but the easiest method is just to write a character out to cout. (You can make the bars thicker by nesting another loop inside the i loop.)

void histo_1 (int values[], int sz)
{
 for(int i = 0; i < sz; i++) {
   for(int k = 0; k < values[i]; k++) cout << `*';
   cout << endl;
 }
}
;> histo_1(sentence_lengths,10);
0
1 ****
2
3 **********
4
5 ****
6 ************
7 ********
8 ******
9 ******

Note that this method is just not going to work with word_lengths, since the values are very large. These values need to be scaled by choosing a maximum value for the display (say 60 characters across) and generating an array within those bounds. The maximum value of the input data is found as before, and the input data is multiplied by the scaling factor, which is just the ratio of the desired maximum value to the actual maximum value.

void scale_data(int in_data[], int out_data[], int sz, int imax)
{
  int maxval = *max_element(in_data,in_data+sz);
  for(int i = 0; i < sz; i++)
     out_data[i] = (imax*in_data[i])/maxval;
}

;> int scaled_lengths[16];
;> scale_data(word_lengths,scaled_lengths,15,60);
;> show_arr(scaled_lengths,15);
0 18 59 60 55 32 21 21 16 7 7 3 4 2 0
;> histo_1(scaled_lengths,15);
0
1 ******************
2 ***********************************************************
3 ************************************************************
4 *******************************************************
5 ********************************
6 *********************
7 *********************
8 ****************
9 *******
10 *******
11 ***
12 ****
13 **
14
;> 

How can you create a histogram that is upright? You start with the maximum value as a level, and then you run along the array, comparing values to the level. If the values are greater than or equal to the level, you write out a string, and if they are less than the level, you write out spaces:

void histo_2 (int values[], int sz)
{
  int level = max_arr(values,sz);   // std method to do this?
  for(; level > 0; level—) {      // leave out 1st part of for-loop
    for(int i = 0; i < sz; i++)
      if (values[i] >= level) cout << "**** ";
                         else cout << "     ";
    cout << endl;
\  }
}
;> scale_data(word_lengths,scaled_lengths,10,20);
;> histo_2(scaled_lengths,10);
               ****
          **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** ****
          **** **** **** ****
          **** **** **** ****
          **** **** **** ****
          **** **** **** **** **** ****
     **** **** **** **** **** **** ****
     **** **** **** **** **** **** **** ****
     **** **** **** **** **** **** **** ****
     **** **** **** **** **** **** **** ****
     **** **** **** **** **** **** **** **** ****
     **** **** **** **** **** **** **** **** ****
**** **** **** **** **** **** **** **** **** ****

This example hardly exploits the supercharged graphics of modern machines, of course. You can also create histograms by using the Turtle Graphics interface of UnderC for Windows. Appendix B, "A Short Library Reference" gives you all the information you need about using Turtle Graphics for now. I will be discussing it further in Chapter 4, "Programs and Libraries."

  • + Share This
  • 🔖 Save To Your Account