In text mining it seems obvious that we should use all the data we can get our hand on for use in drawing conclusions. The temptation is always to use the broadest possible query to select the data set, because we don’t want to miss anything that might be important. The problem with such an all inclusive strategy is that it often adds more noise that obscures the signal we are trying to detect. So for instance, if I’m doing a study for a chocolate candy manufacturer and simply enter the query, “chocolate”, the vast majority of the data I collect for my study will have nothing whatever to do with chocolate candy. This will make it much harder to detect the relevant trends and themes in the data related to chocolate candy because they will be obscured by unrelated issues, such as the color chocolate or chocolate ice cream. So the query “chocolate candy” might actually make more sense, even though it leaves out a lot of relevant data. As long as we have enough data, adding more that is mostly irrelevant could actually make our analysis less effective.
But how much data is enough? The answer may surprise you. It doesn’t really take as much data as you might think to spot a potentially interesting trend or correlation. To see why, let’s try a simple thought experiment. Say we are given a coin and we are told that it may or may not be “loaded”, where a loaded coin is one that when flipped nearly always comes up heads, whereas a normal coin will only come up heads half the time. How many flips of the coin will it require for me to determine that the coin is fair or loaded with 99% confidence. The answer is 7 (the first flip of heads gives me 50% confidence (1/2), the next flip 25% (1/4)… the seventh flip .007 (1/128)). So in this simple experiment I only needed seven data points to tell that something was probably amiss with the coin.
But if seven examples is enough to draw a conclusion from a simple experiment, why do we usually typically use thousands of examples to draw conclusions from text? There are actually a couple of reasons. Partly its because we frequently don’t get to design our experiment before the data is generated. So we basically have to take whatever data is given to us, and some it is certain to be redundant or irrelevant for our purposes. The other issue, is that we usually aren’t simply trying to determine the answer to one yes/no question (e.g. “is the coin loaded or not”) but rather are looking across thousands of potential features and correlations to find a handful that are potentially interesting. When you have to cover more bases, you naturally need more data to do it with.
So the better, more relevant the data, and the more focused the subject of the analysis, the less data you actually need to get an accurate picture. Typically when get a fairly focused set of short documents (paragraphs) that are relevant to the subject under study, I can usually get a pretty good picture of between 25 and 50 themes using between 1000-10000 documents. Right around 500 documents usually turns out to be too small a set to be interesting (it might even be easier just to read the documents one by one, than it is to try to analyze them using text mining techniques). Once I get above 100,000 documents I will usually either sample the data or divide into smaller chunks using some other feature of interest.
The moral of the story is, adding more data is not a panacea. Being thoughtful about what you want to study and why and then carefully selecting data that is relevant to those objectives will produce much better results in the end.