There are many ways that text mining can help the analyst gain a quick understanding of a collection of similar documents. Some of these may include summary word statistics describing what frequently occurring words and phrases the documents in the collection share. Graphical displays such as scatter plots may help to show relationships between documents in a collection. One of the most powerful ways to understand a collection is by viewing its examples directly. Unfortunately, this is also a very time consuming approach. One way to make the process of looking at example documents more effective is through automatic selection of Typical Examples.
Typical Examples are analogous to the concept of customer archetypes in Market Research. Market Researchers will sometimes create a composite customer from a database of customers in a market segment. Sometimes referred to as an “archetype”, the characteristics of this fictional consumer will usually consist of the mean (or median) values of the various data points contained in that segment.
The value of an archetype is that it quickly communicates to the consumers of the market research the most salient features of the market segment. It gets behind the numbers to create a “story” that helps the company decision makers to “see” the customer more clearly and thus make better decisions about how to meet that customer segments needs.
We can use a similar technique in text mining to describe a category of text examples. By selecting the most typical example, we in a sense create an archetype for the text category. This example serves as a single representative document that quickly summarizes the contents of the category. Just like the archetype consumer, the most typical example document serves to tell a story that makes the numbers more explicit to those interpreting text mining results.
So how do we find this “typical example”. The answer is surprisingly simple. Using a vector space model based on dictionary terms we calculate a centroid (average vector) of all examples in the category. We then calculate the distance between the individual vectors of every example in the category and this centroid. The example with the smallest distance (or the one most similar to the centroid) is the “most typical”. All the examples can be sorted using this distance as an ordering criteria so that examples can be reading “most typical order” by the analyst. The numeric feature space centroid that describes the category mathematically is then communicated to the analyst via the example that most nearly resembles it.
This simple technique has many applications in the interpretation of text mining results. It can be used to quickly understand the essence of individual categories in a taxonomy of documents. It can also be used to order the results of a search query, bringing to the top those documents that are most like all the others in the result set. The reverse ordering can even be used to help find outliers and determine the boundaries of a text category.
Ultimately the greatest insights from text mining come not from summary statistics, but from the text of the examples themselves. Choosing the right examples is crucial to getting the most value out of the analysis. Typical examples are one way to insure that the examples we read are the best ones to get the most insight out of our unstructured information.