Home > Articles > Programming > Ruby

  • Print
  • + Share This
  • 💬 Discuss

1.5 Basic Data Structures

Now that we got into Hamlet, let's analyze the text a bit further. For example, for each Dramatis Persona, we'd like to collect some information, such as how many words they say in total, and how rich their vocabulary is. To do that, we need to be able to associate several data items with one persona.4 To group such information in one place, we can define a data structure as follows:

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}

In D you get structs and then you get classes. They share many amenities but have different charters: structs are value types, whereas classes are meant for dynamic polymorphism and are accessed solely by reference. That way confusions, slicing-related bugs, and comments à la // No! Do NOT inherit! do not exist. When you design a type, you decide upfront whether it'll be a monomorphic value or a polymorphic reference. C++ famously allows defining ambiguous-gender types, but their use is rare, error-prone, and objectionable enough to warrant simply avoiding them by design.

In our case, we just need to collect some data and we have no polymorphic ambitions, so using struct is a good choice. Let's now define an associative array mapping persona names to PersonaData values:

PersonaData[string] info;

and all we have to do is fill info appropriately from hamlet.txt. This needs some work because a character's paragraph may extend on several lines, so we need to do some simple processing to coalesce physical lines into paragraphs. To figure out how to do that, let's take a look at a short fragment from hamlet.txt, dumped verbatim below (with leading spaces made visible for clarity):

␣␣Pol. Marry, I will teach you! Think yourself a baby
␣␣␣␣That you have ta'en these tenders for true pay,
␣␣␣␣Which are not sterling. Tender yourself more dearly,
␣␣␣␣Or (not to crack the wind of the poor phrase,
␣␣␣␣Running it thus) you'll tender me a fool.
␣␣Oph. My lord, he hath importun'd me with love
␣␣␣␣In honourable fashion.
␣␣Pol. Ay, fashion you may call it. Go to, go to!

Whether or not Polonius' enthusiasm about goto was a factor in his demise is, even to this day, a matter of speculation. Regardless of that, let's note how each character's line is preceded by exactly two spaces, followed by the (possibly contracted) character's name, followed by a period and a space, finally followed by the actual content of the line. If a logical line extends to multiple physical lines, the continuations are always preceded by exactly four spaces. We could do such simple pattern matching by using a regular expression engine (found in the std.regex module), but we want to learn arrays so let's match things "by hand." We only enlist the help of the Boolean function a.startsWith(b), defined by std.algorithm, which tells whether a starts with b.

The main driver reads input lines, concatenates them in logical paragraphs (ignoring everything that doesn't fit our pattern), passes complete paragraphs to an accumulator function, and at the end prints the desired information.

import std.algorithm, std.ctype, std.regex,
   std.range, std.stdio, std.string;

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}

void main() {
   // Accumulates information about dramatic personae
   PersonaData[string] info;
   // Fill info
   string currentParagraph;
   foreach (line; stdin.byLine()) {
       if (line.startsWith("   ")
             && line.length > 4
             && isalpha(line[4])) {
          // Persona is continuing a line
          currentParagraph ~= line[3 .. $];
       } else if (line.startsWith(" ")
             && line.length > 2
             && isalpha(line[2])) {
          // Persona just started speaking
          addParagraph(currentParagraph, info);
          currentParagraph = line[2 .. $].idup;
       }
   }
   // Done, now print collected information
   printResults(info);
}

After we've equipped ourselves with information on how arrays work, the code should be self-explanatory, save for the presence of .idup. Why is it needed, and what if we forgot about it?

The foreach loop that reads from stdin deposits successive lines of text in the variable line. Because it would be wasteful to allocate a brand new string for each line read, the contents of line is reused every pass through the loop. As such, if you want to squirrel away the contents of a line, you better make a copy of it. Obviously currentParagraph is meant to indeed save text, so duplication is needed, hence the presence of .idup. Now, if we forgot .idup and subsequently the code would still compile and run, the results would be nonsensical and the bug rather hard to find. Having a part of a program modify data held in a different part of the program is very unpleasant to track down because it's a non-local effect (just how many .idups could one forget in a large program?). Fortunately, that's not the case because the types of line and currentParagraph reflect their respective capabilities: line has type char[], i.e., an array of characters that could be overwritten at any time; whereas currentParagraph has type string, which is an array of characters that cannot be individually modified. The two cannot refer to the same memory content because line would break the promise of currentParagraph. So the compiler refuses to compile the erroneous code and demands a copy, which you provide in the form of .idup, and everybody's happy. (The "i" in "idup" stands for "immutable.") On the other hand, when you copy string values around, there's no more need to duplicate the underlying data—they can all point to the same memory because it's known neither will overwrite it, which makes string copying at the same time safe and efficient. Better yet, strings can be shared across threads without problems because, again, there's never contention. Immutability is really cool indeed. If, on the other hand, you need to modify individual characters intensively, you may want to operate on char[], at least temporarily.

PersonaData as defined above is very simple, but in general structs can define not only data, but also other entities such as private sections, member functions, unittests, operators, constructors, and destructor. By default, each data member of a structure is initialized with its default initializer (zero for integral numbers, NaN for floating point numbers,5 and null for arrays and other indirect-access types). Let's now implement addParagraph that slices and dices a line of text and puts it into the associative array.

The line as served by main has the form "Ham. To be, or not to be-that is the question:" We need to find the first ". " to distinguish the persona's name from the actual line. To do so, we use the find function. a.find(b) returns the right-hand portion of a starting with the first occur-rence of b. (If no occurrence is found, find returns an empty string.) While we're at it, we should also do the right thing when collecting the vocabulary. First, we must convert the sentence to lowercase such that capitalized and non-capitalized words count as the same vocabulary element. That's easily taken care of with a call to tolower. Second, we must eliminate a strong source of noise: punctuation that makes for example "him." and "him" count as distinct words. To clean up the vocabulary, all we need to do is pass an additional parameter to split mentioning a regular expression that eliminates all chaff: regex("[ \t,.;:?]+"). With that argument, the split function will consider any sequence of the characters mentioned in between '[' and ']' as part of word separators. That being said, we're ready to do a lot of good stuff in just little code:

void addParagraph(string line, ref PersonaData[string] info) {
   // Figure out persona and sentence
   line = strip(line);
   auto sentence = line.find(". ");
   if (sentence.empty) {
      return;
   }
   auto persona = line[0 .. $ - sentence.length];
   sentence = tolower(strip(sentence[2 .. $]));
   // Get the words spoken
   auto words = split(sentence, regex("[ \t,.;:?]+"));
   // Insert or update information
   auto data = persona in info;
   if (data) {
      // heard this persona before
      data.totalWordsSpoken += words.length;
      foreach (word; words) ++data.wordCount[word];
   } else {
      // first time this persona speaketh
      PersonaData newData;
      newData.totalWordsSpoken = words.length;
      foreach (word; words) newData.wordCount[word] = 1;
      info[persona] = newData;
   }
}

The expression persona in info not only tells whether a given string is present as a key in an associative array, but also provides a handy pointer to the corresponding value in the array. In D there is no need (as it is in C and C++) to use '->' to access data referenced by a pointer—the regular field access operator '.' works unambiguously. If the key was not found, our code creates and inserts a brand new PersonaData, which concludes the addParagraph function.

Finally, let's implement printResults to print a quick summary for each persona:

void printResults(PersonaData[string] info) {
   foreach (persona, data; info) {
      writefln("%20s %6u %6u", persona, data.totalWordsSpoken,
         data.wordCount.length);
   }
}

Ready for a test drive? Save and run!

     Queen  1104    500
       Ros   738    338
       For    55     45
      Fort    74     61
 Gentlemen     4      3
     Other   105     75
      Guil   349    176
       Mar   423    231
      Capt    92     66
      Lord    70     49
      Both    44     24
       Oph   998    401
     Ghost   683    350
       All    20     17
    Player    16     14
      Laer  1507    606
       Pol  2626    870
    Priest    92     66
       Hor  2129    763
      King  4153   1251
Cor., Volt    11     11
 Both [Mar     8      8
       Osr   379    179
      Mess   110     79
    Sailor    42     36
   Servant    11     10
Ambassador    41     34
      Fran    64     47
     Clown   665    298
      Gent   101     77
       Ham 11901   2822
       Ber   220    135
      Volt   150    112
       Rey    80     37

Now that's some fun stuff. Unsurprisingly, our friend "Ham" gets the lion's share by a large margin. Voltemand's (Volt) role is rather interesting: he doesn't have many words to say, but in these few words he does his best in displaying a solid vocabulary, not to mention the Sailor, who almost doesn't repeat himself. Also compare the well-rounded Queen with Ophelia: the Queen has about 10% more words to say than Ophelia, but her vocabulary is about 25% larger.

As you can see, the output has some noise in it (such as "Both [Mar"), easy to fix by a diligent programmer and hardly affecting the important statistics. Nevertheless, fixing the last little glitches would be an instructive (and recommended) exercise.

  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus