RSS Clustering: A Unique Approach for Managing Your RSS Feeds
- Jul 8, 2005
The rapid growth of RDF site summary (RSS)feeds has facilitated the consumption of information that has been part of the promise of the Internet. RSS, a standard XML format that works well for dynamic sites such as blogs and news outlets, enables consumers to easily digest updated information without having to periodically visit every web site they track. One consequence, however, is that many RSS users are unable to manage their feeds easily due to an overwhelming number of new stories. For subjects such as world news, many of the stories are redundant, adding a burden to readers to sort out which stories they've already read. To deal with the twin problems of flooding and redundancy, I've developed an application that reduces the number of items to read and uses the overlapping information to divine interesting topics. In this article, I explain how my prototype system gathers, distills, and presents world news in order to give the reader a more pleasant and efficient RSS experience.
Organizing RSS Feeds
The growth of the quantity of content on the Internet is due in no small part to the ability of any user of the medium to publish to a wide readership, with little effort, at a nominal fee. This revolution in the ability to communicate to a broad audience is as sweeping as the development of the Gutenberg press, with published material that's significantly more accessible to the average world citizen. As a result, the amount of content that a savvy Internet user will want to view as a steady stream can grow substantially. Automated tools have been developed to facilitate this process, helping to ease the burden of keeping current.
RSS aggregation, the process in which various RSS feeds are collected and presented to the user, is typically used as a means of gathering and browsing large amounts of information efficiently. This technique works well for sites that regularly update or change their content. To collect and view these RSS feeds, an application called an RSS aggregator is used. A variety of these aggregators exist for Windows, Mac OS X, and various open source systems such as Linux and BSD. However, the scalability of these RSS aggregator tools is rapidly challenged when a user attempts to monitor more than a few dozen sites using the approach offered by first-generation RSS aggregators. First-generation RSS aggregators are much like an email or a Usenet client, presenting information in a spiral fashion: feed, headline, story. Second-generation RSS aggregators integrate more closely with a web browser, but suffer from similar problems. They also have the more visible difficulty of competing for a very small space in the browser toolbar, which often doesn't have room for more than a half dozen feeds to be front and center.
A typical RSS aggregator has a layout similar to a Usenet reader or a mail user agent. As shown in Figure 1, this model uses a three-pane layout, with navigation typically going from the list of feeds (panel 1) to the list of headlines or stories for the current feed (panel 2) to the extracted entry for a particular title (panel 3). New articles are typically indicated by highlighting the feed and headline.
Figure 1 Layout of the user interface of a typical RSS aggregator application. The feeds list is often on the left side, but sometimes at the top. The stories within any single feed are usually listed in a window above the individual entry being viewed.
The main goal of an RSS reader is to automatically gather content from dynamic sites and highlight new material (see Figure 2). The RSS aggregator application automates this process by periodically polling the subscriptions that the user has chosen, looking for new material. While this system works well for a small number of RSS feeds, this usage model quickly breaks down when the number of feeds grows to several dozen or even hundreds. Under these circumstances, the stream of new content is transformed into a flood of material, and the aggregator tool has automated the gathering of more material to read. Because an RSS reader gathers material that the user requests, it's reasonable to assume that the user may want to examine all of it. However, a typical human reaction to information overload is to began "triaging" material—discarding it or skimming it rapidly. In both cases, information is lost. Furthermore, the information is usually presented without any semantic tags to indicate which material offers the highest value. The user is left to make that determination. Finally, for feeds of a similar nature (news feeds concerning global events, for example), a significant amount of redundancy exacerbates the problem.
Figure 2 The RSS aggregator uses color to indicate which stories have not been read (red) and which have been read (black). Individual feeds are indicated as blue boxes.
As a means of improving the scalability of the RSS aggregation approach, I have begun performing second-order analysis on aggregated materials to make use of the redundancy in the information. I dub this technique RSS clustering because it groups stories by topic (see Figure 3). The redundancy observed in any collection of RSS feeds can be used for two main purposes:
- Highlighting the interesting bits of news within a pool of feeds. This scheme is based on the assumption that the appearance of the topic in multiple entries is proportional to the importance of that topic.
- Clustering entries around these topics. By clustering the entries, we reduce the volume of information presented to the user at any one time.
Figure 3 Stories that are related by a common topic are grouped to indicate their relationship and to streamline the RSS reading process. These stories are pulled from individual feeds.
This technique is not new or novel; it has been demonstrated by sites such as Google news, Topix.net, Daypop, and to a larger extent Popdex and Blogdex. All of these sites aggregate dynamic content and use a set of popularity heuristics to determine topics and content found interesting by the community at large: the news publication community for news sites Topix.net, Daypop, and Google; and the blog publication community for blog sites Blogdex and Popdex. This setup acts as a collaborative filter mechanism.
Topic Mapping and Its Relationship to RSS Clustering
The basic concept behind RSS clustering isn't novel, although this implementation is believed to be one of the first openly described systems to perform such an analysis. One of the roots of this approach lies in topic maps. Topic mapping uses a similar approach to RSS clustering, but is much more mature. It provides an efficient way to identify related material within a large corpus of data. When visualized, topic maps show occurrences on the basis of topics.