InformIT

RSS Clustering: A Unique Approach for Managing Your RSS Feeds

Date: Jul 8, 2005

Return to the article

Content syndicated via RDF site summary (RSS) feeds is a great feature of the Net, with one big drawback: WTMI (way too much information). Jose Nazario discusses a custom RSS aggregation approach that allows the user to handle large volumes of RSS data, as well as find interesting trends within the flood.

The rapid growth of RDF site summary (RSS)feeds has facilitated the consumption of information that has been part of the promise of the Internet. RSS, a standard XML format that works well for dynamic sites such as blogs and news outlets, enables consumers to easily digest updated information without having to periodically visit every web site they track. One consequence, however, is that many RSS users are unable to manage their feeds easily due to an overwhelming number of new stories. For subjects such as world news, many of the stories are redundant, adding a burden to readers to sort out which stories they've already read. To deal with the twin problems of flooding and redundancy, I've developed an application that reduces the number of items to read and uses the overlapping information to divine interesting topics. In this article, I explain how my prototype system gathers, distills, and presents world news in order to give the reader a more pleasant and efficient RSS experience.

Organizing RSS Feeds

The growth of the quantity of content on the Internet is due in no small part to the ability of any user of the medium to publish to a wide readership, with little effort, at a nominal fee. This revolution in the ability to communicate to a broad audience is as sweeping as the development of the Gutenberg press, with published material that's significantly more accessible to the average world citizen. As a result, the amount of content that a savvy Internet user will want to view as a steady stream can grow substantially. Automated tools have been developed to facilitate this process, helping to ease the burden of keeping current.

RSS Aggregators

RSS aggregation, the process in which various RSS feeds are collected and presented to the user, is typically used as a means of gathering and browsing large amounts of information efficiently. This technique works well for sites that regularly update or change their content. To collect and view these RSS feeds, an application called an RSS aggregator is used. A variety of these aggregators exist for Windows, Mac OS X, and various open source systems such as Linux and BSD. However, the scalability of these RSS aggregator tools is rapidly challenged when a user attempts to monitor more than a few dozen sites using the approach offered by first-generation RSS aggregators. First-generation RSS aggregators are much like an email or a Usenet client, presenting information in a spiral fashion: feed, headline, story. Second-generation RSS aggregators integrate more closely with a web browser, but suffer from similar problems. They also have the more visible difficulty of competing for a very small space in the browser toolbar, which often doesn't have room for more than a half dozen feeds to be front and center.

A typical RSS aggregator has a layout similar to a Usenet reader or a mail user agent. As shown in Figure 1, this model uses a three-pane layout, with navigation typically going from the list of feeds (panel 1) to the list of headlines or stories for the current feed (panel 2) to the extracted entry for a particular title (panel 3). New articles are typically indicated by highlighting the feed and headline.

Figure 1

Figure 1 Layout of the user interface of a typical RSS aggregator application. The feeds list is often on the left side, but sometimes at the top. The stories within any single feed are usually listed in a window above the individual entry being viewed.

The main goal of an RSS reader is to automatically gather content from dynamic sites and highlight new material (see Figure 2). The RSS aggregator application automates this process by periodically polling the subscriptions that the user has chosen, looking for new material. While this system works well for a small number of RSS feeds, this usage model quickly breaks down when the number of feeds grows to several dozen or even hundreds. Under these circumstances, the stream of new content is transformed into a flood of material, and the aggregator tool has automated the gathering of more material to read. Because an RSS reader gathers material that the user requests, it's reasonable to assume that the user may want to examine all of it. However, a typical human reaction to information overload is to began "triaging" material—discarding it or skimming it rapidly. In both cases, information is lost. Furthermore, the information is usually presented without any semantic tags to indicate which material offers the highest value. The user is left to make that determination. Finally, for feeds of a similar nature (news feeds concerning global events, for example), a significant amount of redundancy exacerbates the problem.

Figure 2

Figure 2 The RSS aggregator uses color to indicate which stories have not been read (red) and which have been read (black). Individual feeds are indicated as blue boxes.

RSS Clustering

As a means of improving the scalability of the RSS aggregation approach, I have begun performing second-order analysis on aggregated materials to make use of the redundancy in the information. I dub this technique RSS clustering because it groups stories by topic (see Figure 3). The redundancy observed in any collection of RSS feeds can be used for two main purposes:

Figure 3

Figure 3 Stories that are related by a common topic are grouped to indicate their relationship and to streamline the RSS reading process. These stories are pulled from individual feeds.

This technique is not new or novel; it has been demonstrated by sites such as Google news, Topix.net, Daypop, and to a larger extent Popdex and Blogdex. All of these sites aggregate dynamic content and use a set of popularity heuristics to determine topics and content found interesting by the community at large: the news publication community for news sites Topix.net, Daypop, and Google; and the blog publication community for blog sites Blogdex and Popdex. This setup acts as a collaborative filter mechanism.

Topic Mapping and Its Relationship to RSS Clustering

The basic concept behind RSS clustering isn't novel, although this implementation is believed to be one of the first openly described systems to perform such an analysis. One of the roots of this approach lies in topic maps. Topic mapping uses a similar approach to RSS clustering, but is much more mature. It provides an efficient way to identify related material within a large corpus of data. When visualized, topic maps show occurrences on the basis of topics.

A New Approach to RSS Consumption

My approach goes one step beyond topic mapping by clustering similar topic occurrences. Here's a quick view of the method (I'll get to specifics in a second):

  1. Collect the data by downloading and parsing the RSS feeds.
  2. Normalize the data into a specific format.
  3. Normalize the case, remove punctuation, and strip the stop words.
  4. Stem the remaining terms to find their root words.
  5. Correlate the data to find interesting terms.
  6. Cluster the news stories by terms of interest.
  7. Display the results.

Collecting the Data

We collect the data by downloading and parsing the RSS feeds in question. This can be a collection of feeds from sources on any topic that you want. They work best if you either have a focused topic and feeds surrounding that topic. If you want a broader set of topics, you'll need to scale up the quantity of feeds appropriately.

Normalizing the Data

Once you have the feeds, you normalize the data into a format that you can easily work with. This may mean inserting the entries into a database as rows or storing them in some other format. Storing in XML is a possibility, but you'll have to use additional XML processing tools to gather the data, incurring a lot of overhead for processing and complexity. I find it most useful to rip the data from the feeds into a small database and then re-extract it for analysis. This strategy allows me to select a fixed number of stories for analysis and also to store the data over time for more long-term analysis.

Removing Stop Words

Next, we need to normalize the case, remove punctuation, and strip stop words from the data to get at the bare and interesting terms. Stop words are terms that appear in almost any entry and don't give any meaning, such as I, you, and—even terms like world or national. Punctuation removal is pretty obvious; basically, you want to get at the bare words. In this step, a paragraph like the following:

A man named Fred was stopped outside of your home today. He was walking
a dog named Chief past a fire hydrant on Maple Street.

would end up looking like this:

man named fred stopped outside home today walking dog named chief past
fire hydrant maple street

Additional criteria are always possible to weed out the uninteresting bits.

Determining the Root Words

Now we'll stem the remaining words to find their root words. This step accomplishes a simple task: It reduces false differences and increases legitimate overlap. Words such as fires, fired, and fire all look the same once their endings have been removed. The sentence from the previous step would now look like this:

man name fred stop outside home today walk dog name chief past
fire hydrant maple street

Correlating the Data

We discover the "interesting" terms by correlating the data by any number of means. A static, simple approach is to count terms left over from the previous steps and order based on those counts. More advanced methods include natural language parsing (NLP) approaches to discover the subjects of the sentences. Time-based systems can look at a window of time and count appearances of interesting terms; or, if you like the Daypop style, look at the relative trajectories of terms over time. In that setup, the interesting topics aren't the ones that are the most popular, but the fastest rising. Another approach may be to look at topics that appear together, such as a subject and a location.

A simple frequency count of the earlier example, pruned and stemmed, would look like this:

2 name
1 walk
1 today
1 street
1 stop
1 past
1 outside
1 maple
1 man
1 hydrant
1 home
1 fred
1 fire
1 dog
1 chief

Clustering the Stories

We cluster news stories by finding which stories are related to the terms of interest. Some people will allow a story to be reused multiple times for any terms that it contains, but I prefer to use a story only once. This plan reduces the size of the output (but risks that the term used to pull it in the first place was not the best subject of the story).

Displaying the Results

Results can be displayed in any number of formats. The following section shows the results of clustering in a directed graph approach, for example, or you can use a text format for direct reading. The display should give some indication of which stories are grouped together and should facilitate navigation of the output.

Program Specs

The steps previously described have been implemented in a small toolkit written in the Python language. Here are some details:

A list of more than 60 individual RSS feeds from several dozen individual world news sites were processed using this technique. Of 1,154 unique stories, 458 were correlated to other stories within this data set. The clustering that resulted from this process is shown in Figure 4. The process took approximately two minutes on a Pentium 4 1.4 GHz processor running Python 2.3, which included fetching all URLs listed in the list of feeds. Processing using neato took approximately three minutes.

Figure 4

Figure 4 Sample output of RSS clustering with input data from 66 sources (NYT, BBC, AP, UPI, etc.). This data was gathered on 31 August 2004. The clustering output was processed using the neato tool from the Graphviz toolkit.

Headlines at the center of the cluster are the story with the most detail in the story description, not necessarily the one closest to all of the stories. With Adobe's SVG viewer, which supports Linux, Windows, and OS X, the user would be able to use the right mouse button to get the SVG contextual menu, and then zoom or move the image.

At this point, no metrics are included to evaluate the quality of the groupings. This makes it difficult to understand the impact of any intended improvements on this approach.

A more practical use of the mechanism is shown in the next couple of figures. I've used these methods to develop a personal news site called Monkey News, using several dozen news feeds from around the world. Because of the structure of a typical news feed, subjects are clearly indicated. A typical news headline, for example, usually includes a subject and an action, and often a geographic location. Most daily newspapers follow this format for regular stories, which must clearly indicate to the reader which events are news and what they may want to read. Magazines and other, less-frequent sources of news often don't identify the subject so clearly. These sources are difficult to use in the aggregation system without deeper analysis.

In this system, the top six topics are indicated in the header. Stories are arranged in descending order from most popular to least popular and are gathered by topic. The system presents this information as a static HTML page updated every two hours (see Figure 5).

Figure 5

Figure 5 Image capture of the front page of the Monkey News site, showing the top six subjects and the first two clusters of stories grouped by topic. Stories appear in descending order of popularity as determined by topic mention from the various sites.

An individual group is shown in Figure 6. The top of the grouping includes the headline, source identification, and a small paragraph describing the topic. Additional sources are grouped underneath this initial story, allowing for browsing of these stories.

Figure 6

Figure 6 Screen capture of an individual story block from Monkey News. The grouping is clearly identifiable based on the topic, and the usability of a single group is evident. The top story in the group has the description of the story, with supplemental links and sources clearly identified.

Using this method, the number of news sources can be easily expanded without increasing the burden on the reader. Instead, the accuracy of the groups is improved. This data is inferred from the quantity of sources "talking" about a subject, which provides a "vote" to the popularity of the topic. This data is inherent in a collection of stories and feeds and doesn't come from external sources.

Monkey News has proven to be a useful site. During the 2004 presidential election, for example, the viewpoints of various news sites were constantly available. During the invasion of Iraq by U.S.-led forces, a constant stream of updates were available from various sources both foreign and domestic. While this information could have been gathered using a traditional RSS aggregator, the number of stories would have been overwhelming. This new approach clearly reduces clutter and improves readability over the basic, flat approach of an RSS aggregator. More than 1,000 stories a day are gathered, grouped, and presented using this system.

Evaluation of the Method

While the method described above works well for handling large quantities of world news feeds, it fails in a number of other situations. These inherent limitations in the approach require significant changes to the method.

Future Work

A number of improvements can be made to the general strategy I've shown here:

It would be interesting to build a desktop application that acted in this fashion. The storage requirements for any single user would be minimal, but the quantity of feeds to poll for data would be relatively large. It may be better to simply provide a second RSS feed of the Monkey News output.

Related Information

800 East 96th Street, Indianapolis, Indiana 46240