A number of improvements can be made to the general strategy I've shown here:
- The original implementation used a naïve approach to discover news topics. It would be relatively easy to incorporate the Python natural language toolkit to do actual term analysis. Digram and trigram analysis could be performed for special terms (such as white house and sierra leone), rather than individual terms.
- Time-resolved clustering, using techniques such as Bayesian network analysis, could help to determine which terms are related and highlight previously unknown relationships.
- Additional time-resolved analysis might include term analysis based on ranking over time, for evaluating the effectiveness of public relations efforts and major events such as political campaigns.
- Clustering can be further improved by adding data to the system. Currently the implementation clusters based only on terms that appear in the headlines. Using the full body of the news story, additional contextual information can be gathered to produce more robust groupings.
It would be interesting to build a desktop application that acted in this fashion. The storage requirements for any single user would be minimal, but the quantity of feeds to poll for data would be relatively large. It may be better to simply provide a second RSS feed of the Monkey News output.