A New Approach to RSS Consumption
My approach goes one step beyond topic mapping by clustering similar topic occurrences. Here's a quick view of the method (I'll get to specifics in a second):
- Collect the data by downloading and parsing the RSS feeds.
- Normalize the data into a specific format.
- Normalize the case, remove punctuation, and strip the stop words.
- Stem the remaining terms to find their root words.
- Correlate the data to find interesting terms.
- Cluster the news stories by terms of interest.
- Display the results.
Collecting the Data
We collect the data by downloading and parsing the RSS feeds in question. This can be a collection of feeds from sources on any topic that you want. They work best if you either have a focused topic and feeds surrounding that topic. If you want a broader set of topics, you'll need to scale up the quantity of feeds appropriately.
Normalizing the Data
Once you have the feeds, you normalize the data into a format that you can easily work with. This may mean inserting the entries into a database as rows or storing them in some other format. Storing in XML is a possibility, but you'll have to use additional XML processing tools to gather the data, incurring a lot of overhead for processing and complexity. I find it most useful to rip the data from the feeds into a small database and then re-extract it for analysis. This strategy allows me to select a fixed number of stories for analysis and also to store the data over time for more long-term analysis.
Removing Stop Words
Next, we need to normalize the case, remove punctuation, and strip stop words from the data to get at the bare and interesting terms. Stop words are terms that appear in almost any entry and don't give any meaning, such as I, you, and—even terms like world or national. Punctuation removal is pretty obvious; basically, you want to get at the bare words. In this step, a paragraph like the following:
A man named Fred was stopped outside of your home today. He was walking a dog named Chief past a fire hydrant on Maple Street.
would end up looking like this:
man named fred stopped outside home today walking dog named chief past fire hydrant maple street
Additional criteria are always possible to weed out the uninteresting bits.
Determining the Root Words
Now we'll stem the remaining words to find their root words. This step accomplishes a simple task: It reduces false differences and increases legitimate overlap. Words such as fires, fired, and fire all look the same once their endings have been removed. The sentence from the previous step would now look like this:
man name fred stop outside home today walk dog name chief past fire hydrant maple street
Correlating the Data
We discover the "interesting" terms by correlating the data by any number of means. A static, simple approach is to count terms left over from the previous steps and order based on those counts. More advanced methods include natural language parsing (NLP) approaches to discover the subjects of the sentences. Time-based systems can look at a window of time and count appearances of interesting terms; or, if you like the Daypop style, look at the relative trajectories of terms over time. In that setup, the interesting topics aren't the ones that are the most popular, but the fastest rising. Another approach may be to look at topics that appear together, such as a subject and a location.
A simple frequency count of the earlier example, pruned and stemmed, would look like this:
2 name 1 walk 1 today 1 street 1 stop 1 past 1 outside 1 maple 1 man 1 hydrant 1 home 1 fred 1 fire 1 dog 1 chief
Clustering the Stories
We cluster news stories by finding which stories are related to the terms of interest. Some people will allow a story to be reused multiple times for any terms that it contains, but I prefer to use a story only once. This plan reduces the size of the output (but risks that the term used to pull it in the first place was not the best subject of the story).
Displaying the Results
Results can be displayed in any number of formats. The following section shows the results of clustering in a directed graph approach, for example, or you can use a text format for direct reading. The display should give some indication of which stories are grouped together and should facilitate navigation of the output.